[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670455#action_12670455 ]
Michael McCandless commented on LUCENE-1536: -------------------------------------------- Some ideas / further things to explore: * Deletions, top-level filters, and BooleanQuery that "factors" to a toplevel AND really should be handled by the same code. * Even an AND'd filter on a sub-clause of a BooleanQuery can be pushed down to the TermDocs under that tree. * That common code should send a top-level filter down to the lowest level, used by random access API, if the filter supports random access (not all do) and it's not super sparse. * I think one thing slowing down trunk is the lack of a Scorer.skipToButNotNext API. We now ask the filter for its next(), which gives us a filterDocID. Then we call Scorer.skipTo(filterDocID). If the scorer does not match that filterDocID, it internally does next(), which for an expensive scorer is alot of likely wasted work: it advances to a docID that the filter may not accept. If we had a "skipToButNotNext" API we could avoid that wasted work. I'm curious what gains this change alone would provide. * I'm thinking (but haven't tested this) if the filter is relatively sparse compared to the other iterators, it'd be better to convert it to a sparse repr (eg SortedVIntList) and drive the search by iteration through the filter, after fixing the above skipTo issue. Maybe a "low iterator" access. * We may need a "filter optimizer" utility class somewhere, somehow. For filters you do not plan to re-use, you would not bother with this. But for filters that will be re-used, you should 1) convert them to sparse or non-sparse repr depending on their density, 2) maybe invert them and make sparse if they are close to 100% density, 3) maybe factor in deletions to the filter so there is only a single top-level filter to apply. * I'm not yet sure how to make this change cleanly to the APIs... > if a filter can support random access API, we should use it > ----------------------------------------------------------- > > Key: LUCENE-1536 > URL: https://issues.apache.org/jira/browse/LUCENE-1536 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Affects Versions: 2.4 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: LUCENE-1536.patch > > > I ran some performance tests, comparing applying a filter via > random-access API instead of current trunk's iterator API. > This was inspired by LUCENE-1476, where we realized deletions should > really be implemented just like a filter, but then in testing found > that switching deletions to iterator was a very sizable performance > hit. > Some notes on the test: > * Index is first 2M docs of Wikipedia. Test machine is Mac OS X > 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. > * I test across multiple queries. 1-X means an OR query, eg 1-4 > means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 > AND 3 AND 4. "u s" means "united states" (phrase search). > * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, > 95, 98, 99, 99.99999 (filter is non-null but all bits are set), > 100 (filter=null, control)). > * Method high means I use random-access filter API in > IndexSearcher's main loop. Method low means I use random-access > filter API down in SegmentTermDocs (just like deleted docs > today). > * Baseline (QPS) is current trunk, where filter is applied as iterator up > "high" (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org