eks dev wrote:

I agree we should warn about this in the javadocs... can you work up a patch?

I'll give it a try, no promise when, changing job, moving...

I plan to run some tests to figure out the performance tradeoffs here.

We switched to iterator access for a toplevel filter, as of LUCENE-584, but from LUCENE-1476 it's looking like except for fairly sparse filters, random access is
much faster.

have a look at  https://issues.apache.org/jira/browse/LUCENE-1436
this should be important for deletions case

OK.

I'll just keep dumping my thinking about it, maybe something meaningful comes out, unfortunately not enough time to think deeper or try it now..

No problem; keep it coming!

as long as we look at single term queries, high deletion density cases should be faster with random access (or anything else) at TermDocs level because we will be just propagating decision higher up instead of "killing" document at TermDocs level. Cases with more terms, disjunctions, are getting interesting, starting to feel speed- up proportional to the number of intersectiong documents.

for Query (A OR B) we need to check if(deleted) condition #A + #B times if we do it at TermDocs level, in filter case we need to do it only

#(A\B) + #(B\A) + #(A AND B) and this number is smaller or equal (worst case) than #A + #B

this is exactly the case that makes performance headaches.

Right! This is why I'm going to test the full matrix (different queries X different sparseness filters).

We have two competing issues, constant time factor on skipTo() vs get() and algorithmic enhancement due to saved checks. Balance depends on Query and skipTo()/get() performance diff.

Maybe thinking along the "Filter with both options" lines, random (optional support) and iterator?

Right: I'm guessing we eventually need a simplistic query optimizer that would choose how to apply a top-level AND'd filtered (or AND NOT, eg for deletions). If the filter is very sparse, it's probably best to use iterator especially if filter is already iterator-friendly, eg SortedVIntList.

At the end of a day, Filter works at API level with DocIdSet, not DocIdSetIterator.... that would remove constant factor, the question is this possible to add optional DocIdSet.get(int ) on current API and use it for some specialized cases like this one.

Either add that random-access API, or make a RandomAccessDocIdSet, or... something. Not sure yet. Ideally BooleanQuery (which we should consolidate deletions & top-level filters under) should somehow ask the filter if it's random access, and then drive the matching accordingly.

also, math for conjunctions looks much better in filters

I'll test conjunctions too.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to