eks dev wrote:
I agree we should warn about this in the javadocs... can you work
up a patch?
I'll give it a try, no promise when, changing job, moving...
I plan to run some tests to figure out the performance tradeoffs
here.
We switched to iterator access for a toplevel filter, as of
LUCENE-584, but from
LUCENE-1476 it's looking like except for fairly sparse filters,
random access is
much faster.
have a look at https://issues.apache.org/jira/browse/LUCENE-1436
this should be important for deletions case
OK.
I'll just keep dumping my thinking about it, maybe something
meaningful comes out, unfortunately not enough time to think deeper
or try it now..
No problem; keep it coming!
as long as we look at single term queries, high deletion density
cases should be faster with random access (or anything else) at
TermDocs level because we will be just propagating decision higher
up instead of "killing" document at TermDocs level. Cases with more
terms, disjunctions, are getting interesting, starting to feel speed-
up proportional to the number of intersectiong documents.
for Query (A OR B) we need to check if(deleted) condition #A + #B
times if we do it at TermDocs level, in filter case we need to do it
only
#(A\B) + #(B\A) + #(A AND B) and this number is smaller or equal
(worst case) than #A + #B
this is exactly the case that makes performance headaches.
Right! This is why I'm going to test the full matrix (different
queries X different sparseness filters).
We have two competing issues, constant time factor on skipTo() vs
get() and algorithmic enhancement due to saved checks. Balance
depends on Query and skipTo()/get() performance diff.
Maybe thinking along the "Filter with both options" lines, random
(optional support) and iterator?
Right: I'm guessing we eventually need a simplistic query optimizer
that would choose how to apply a top-level AND'd filtered (or AND NOT,
eg for deletions). If the filter is very sparse, it's probably best
to use iterator especially if filter is already iterator-friendly, eg
SortedVIntList.
At the end of a day, Filter works at API level with DocIdSet, not
DocIdSetIterator.... that would remove constant factor, the question
is this possible to add optional DocIdSet.get(int ) on current API
and use it for some specialized cases like this one.
Either add that random-access API, or make a RandomAccessDocIdSet,
or... something. Not sure yet. Ideally BooleanQuery (which we should
consolidate deletions & top-level filters under) should somehow ask
the filter if it's random access, and then drive the matching
accordingly.
also, math for conjunctions looks much better in filters
I'll test conjunctions too.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org