[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uwe Schindler updated LUCENE-1536: ---------------------------------- Attachment: changes-yonik-uwe.patch LUCENE-1536.patch LUCENE-1536-rewrite.patch Attached you will find a new patch LUCENE-1536.patch, incorporating Yonik's changes plus some minor improvements: - changed Javadocs of DIS.bits() to explain what you should do/not do. - Added another early exit condition in FilteredQuery#Weight.scorer(): As we already get the first matching doc of the filter iterator before looking at bits or creating the query scorer, we should erly exit, if the first matching doc is Disi.NO_MORE_DOCS. This saves us from creating the Query Scorer. - I removed Robert's safety TODO in SolrIndexSearcher. It no longer disabled random access completely. After Yoniks changes, all places in Solr that are not random access secure are disabled - e.g. SolrIndexSearcher.FilterImpl (not sure what this class does, maybe it should also implement bits()?) - we should do that in a Solr specific optimization issue. Some other cool thing with filters is ANDing filters without ChainedFilter (this approach is is very effective with random access as it does not allocate additional BitsSet). If you want to AND together several filters and apply them to a Query, do the following: {code:java} IS.search(new FilteredQuery(query,filter2), filter1,...); {code} You can chain even more filters in by adding more FilteredQueries. What this does: IS will automatically create another FilteredQuery to apply the filter and get the Weight of the top-level FilteredQuery. The scorer of this one will be top-level, get the filter and if it is random access, it will execute the filter with acceptDocs==liveDocs. The result bits of this filter will be passed to Weight.scorer of the second FilteredQuery as acceptDocs. This one will pass the acceptDocs (which are already filtered) to its Filter and if again random access pass those as acceptDocs to the inner Query's scorer. Finally the top-level IS will execute scorer.score(Collector), which in fact is the inner Query's scorer (no wrappers!) with all filtering applied in acceptDocs. This is incredible cool :-) One thing about large patches in an issue: If you are working on an issue and have you local changes in your checkout and posted a patch to an issue and somebody else, posted an updated patch to an issue, it is often nice to see the diff between those patches. I wanted to see what Yonik changed, but a 140 K patch is not easy to handle. The trick is "interdiff" from patchutils package: You can call "interdiff LUCENE-1536-original.patch LUCENE-1536-yonik.patch" and you get a patch of only changes applied by Yonik. This patch can even be applied to your local already patched checkout. The changes-yonik-uwe.patch was generated that way and shows, what changes I did in my last patch in contrast to Yoniks original. > if a filter can support random access API, we should use it > ----------------------------------------------------------- > > Key: LUCENE-1536 > URL: https://issues.apache.org/jira/browse/LUCENE-1536 > Project: Lucene - Java > Issue Type: Improvement > Components: core/search > Affects Versions: 2.4 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Labels: gsoc2011, lucene-gsoc-11, mentor > Fix For: 4.0 > > Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, > LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, > LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, > LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, > LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, > LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, > LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, > LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, > LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, > LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, > LUCENE-1536.patch, changes-yonik-uwe.patch, luceneutil.patch > > > I ran some performance tests, comparing applying a filter via > random-access API instead of current trunk's iterator API. > This was inspired by LUCENE-1476, where we realized deletions should > really be implemented just like a filter, but then in testing found > that switching deletions to iterator was a very sizable performance > hit. > Some notes on the test: > * Index is first 2M docs of Wikipedia. Test machine is Mac OS X > 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. > * I test across multiple queries. 1-X means an OR query, eg 1-4 > means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 > AND 3 AND 4. "u s" means "united states" (phrase search). > * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, > 95, 98, 99, 99.99999 (filter is non-null but all bits are set), > 100 (filter=null, control)). > * Method high means I use random-access filter API in > IndexSearcher's main loop. Method low means I use random-access > filter API down in SegmentTermDocs (just like deleted docs > today). > * Baseline (QPS) is current trunk, where filter is applied as iterator up > "high" (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org