[ https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand updated LUCENE-5938: --------------------------------- Attachment: low_freq.tasks LUCENE-5938.patch OK, I did something slightly different. It happens that all queries in the tasks file match a pretty large number of documents, which favors FixedBitSet. So now I've configured a threshold: FixedBitSet is used when more than maxDoc / 16384 docs match and SparseFixedBitSet is used otherwise. Since SparseFixedBitSet is much faster than FixedBitSet for such low densities, the cost to start by creating a SparseFixedBitSet and then upgrading to a FixedBitSet is negligible compared to starting with a FixedBitSet from the beginning (see http://people.apache.org/~jpountz/doc_id_sets2.html). So now the benchmark looks better for those queries that match many documents: {noformat} IntNRQ 7.10 (6.3%) 6.57 (9.6%) -7.4% ( -21% - 9%) Prefix3 110.36 (14.8%) 109.88 (9.5%) -0.4% ( -21% - 28%) Wildcard 62.83 (14.5%) 66.93 (9.5%) 6.5% ( -15% - 35%) {noformat} I don't think the improvement with {{Wildcard}} is noise, I can reproduce it easily. I think the reason is that since the default is filter rewrite now, we don't have to compute the terms intersection twice, which is costly with wildcard queries. I also wanted to see what happens with queries that match fewer documents compared to boolean rewrite, so I generated a set of wildcard queries that are expanded to a couple of terms and don't match too many documents (see tasks file attached): {noformat} Wildcard 99.90 (9.0%) 294.66 (30.6%) 194.9% ( 142% - 257%) {noformat} For such queries, the new default rewrite method looks much better. > New DocIdSet implementation with random write access > ---------------------------------------------------- > > Key: LUCENE-5938 > URL: https://issues.apache.org/jira/browse/LUCENE-5938 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Assignee: Adrien Grand > Attachments: LUCENE-5938.patch, LUCENE-5938.patch, LUCENE-5938.patch, > low_freq.tasks > > > We have a great cost API that is supposed to help make decisions about how to > best execute queries. However, due to the fact that several of our filter > implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets, > either we use the cost API and make bad decisions, or need to fall back to > heuristics which are not as good such as > RandomAccessFilterStrategy.useRandomAccess which decides that random access > should be used if the first doc in the set is less than 100. > On the other hand, we also have some nice compressed and cacheable DocIdSet > implementation but we cannot make use of them because TermsFilter requires a > DocIdSet that has random write access, and FixedBitSet is the only DocIdSet > that we have that supports random access. > I think it would be nice to replace FixedBitSet in those filters with another > DocIdSet that would also support random write access but would have a better > cost? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org