On Mon, Apr 3, 2017 at 6:25 PM, Adrien Grand <jpou...@gmail.com> wrote: > Large boolean queries can cause a lot of random access as each sub clause > is advanced one after the other. Even in the case that everything fits in > the filesystem cache, the fact that the heap needs to be rebalanced after > each documents makes queries on many clauses slow. This is why we have > TermInSetQuery (TermsQuery on 6.x): it has a more disk-friendly access > pattern (1 seek per term per segment) and scales better with the number of > terms. Unfortunately it does not only come with benefits and its main > drawback is that it is always evaluated againts the entire index. So if you > intersect a very selective query (on an id field for instance) with a large > TermInSetQuery, the TermInSetQuery will dominate the execution time for > sure.
One such case which we do have is searching on file digests, where all the values are spread across the entire index, and the common prefixes don't allow much of a win from things like automata. For those, though, TermsQuery might still work. The problem is more things like word lists, where one "word" might analyse to multiple terms, making a phrase query - which prevents using TermsQuery. Collapsing it to some kind of conditional multi-phrase query... yeah, I have no idea whether there is any sensible way to do it. TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org