> I agree we should warn about this in the javadocs... can you work up a patch?

I'll give it a try, no promise when, changing job, moving...

> I plan to run some tests to figure out the performance tradeoffs here.
> 
> We switched to iterator access for a toplevel filter, as of LUCENE-584, but 
> from 
> LUCENE-1476 it's looking like except for fairly sparse filters, random access 
> is 
> much faster.

have a look at  https://issues.apache.org/jira/browse/LUCENE-1436
this should be important for deletions case 

I'll just keep dumping my thinking about it, maybe something meaningful comes 
out, unfortunately not enough time to think deeper or try it now.. 

as long as we look at single term queries, high deletion density cases should 
be faster with random access (or anything else) at TermDocs level because we 
will be just propagating decision higher up instead of "killing" document at 
TermDocs level. Cases with more terms, disjunctions, are getting interesting, 
starting to feel speed-up proportional to the number of intersectiong documents.

for Query (A OR B) we need to check if(deleted) condition  #A + #B times if we 
do it at TermDocs level, in filter case we need to do it only 

#(A\B) + #(B\A) + #(A AND B) and this number is smaller or equal (worst case) 
than  #A + #B  

this is exactly the case that makes performance headaches.

We have two competing issues, constant time factor on skipTo() vs get() and 
algorithmic enhancement due to saved checks. Balance depends on Query and 
skipTo()/get() performance diff.

Maybe thinking along the "Filter with both options" lines, random (optional 
support) and iterator? At the end of a day, Filter works at API level with 
DocIdSet, not DocIdSetIterator.... that would remove constant factor, the 
question is this possible to add optional DocIdSet.get(int ) on current API and 
use it for some specialized cases like this one.  

also, math for conjunctions looks much better in filters

sorry for the noise, all said here is no more than thinking aloud and probably 
does not make much sense.

cheers, eks

  

----- Original Message ----
> From: Michael McCandless <luc...@mikemccandless.com>
> To: java-dev@lucene.apache.org
> Sent: Tuesday, 3 February, 2009 18:28:14
> Subject: Re: [jira] Created: (LUCENE-1533) Deleted documents as a Filter or 
> top level Query
> 
> 
> eks dev wrote:
> 
> > Thanks for confirming it.
> > 
> > That is good to know and I am sure there are good reasons for it 
> (performance). Anyhow, sounds like good mouse trap that probably deserves a 
> few 
> comments in javadoc.
> > 
> > - From the fact that term exists in term dictionary one cannot conclude 
> > that 
> there are actual documents containing it (people using external IDs and 
> taking 
> shortcut in checking if document exists in Index by checking existence in 
> term 
> dictionary; Spell checkers that index terms from index)...
> > 
> > - Stats are stale and change in time (I have seen comments about it 
> > somewhere)
> 
> I agree we should warn about this in the javadocs... can you work up a patch?
> 
> > As a luxury option (this all is really not a big deal), maybe an idea would 
> > be 
> to have some sort of lightweight optimize "refreshStatsAndLexicon()" that 
> just 
> brings stats and term dict into consistent state, without touching postings / 
> stored fields and other heavy things?\
> 
> That's a neat idea.  We can't do this today (the terms dict is "write once" 
> per 
> segment), but with a small change to allow terms dict to be rewritten to a 
> different generation file (like how deletes are handled) we could do this.  
> Not 
> sure how much it'd be used though (I don't remember users complaining about 
> this 
> on the lists, I think).
> 
> > Having this clarified, back to the original question, I am now 95% sure 
> "Deleted Docs as Filters" will be faster (for cases with more than one 
> term/Clause in Query) or equally fast for single term queries. 5% uncertainty 
> comes from skipTo() vs get(int i) performance diff. Imo, this can be visible 
> only for single term Queries in high density case, maybe not even there...
> 
> I plan to run some tests to figure out the performance tradeoffs here.
> 
> We switched to iterator access for a toplevel filter, as of LUCENE-584, but 
> from 
> LUCENE-1476 it's looking like except for fairly sparse filters, random access 
> is 
> much faster.
> 
> So I plan to test applying a filter at the top-level w/ iterator (= trunk, 
> baseline), applying filter at top-level w/ random-access, applying filter way 
> at 
> the bottom w/ random access (in SegmentTermDocs, just like deleted docs are 
> done 
> today), across different queries and different filter sparseness.
> 
> Mike
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to