Re: [jira] Created: (LUCENE-1533) Deleted documents as a Filter or top level Query

Michael McCandless Tue, 03 Feb 2009 09:28:54 -0800


eks dev wrote:

Thanks for confirming it.
That is good to know and I am sure there are good reasons for it(performance). Anyhow, sounds like good mouse trap that probablydeserves a few comments in javadoc.
- From the fact that term exists in term dictionary one cannotconclude that there are actual documents containing it (people usingexternal IDs and taking shortcut in checking if document exists inIndex by checking existence in term dictionary; Spell checkers thatindex terms from index)...
- Stats are stale and change in time (I have seen comments about itsomewhere)

I agree we should warn about this in the javadocs... can you work up apatch?

As a luxury option (this all is really not a big deal), maybe anidea would be to have some sort of lightweight optimize"refreshStatsAndLexicon()" that just brings stats and term dict intoconsistent state, without touching postings / stored fields andother heavy things?\

That's a neat idea. We can't do this today (the terms dict is "writeonce" per segment), but with a small change to allow terms dict to berewritten to a different generation file (like how deletes arehandled) we could do this. Not sure how much it'd be used though (Idon't remember users complaining about this on the lists, I think).

Having this clarified, back to the original question, I am now 95%sure "Deleted Docs as Filters" will be faster (for cases with morethan one term/Clause in Query) or equally fast for single termqueries. 5% uncertainty comes from skipTo() vs get(int i)performance diff. Imo, this can be visible only for single termQueries in high density case, maybe not even there...


I plan to run some tests to figure out the performance tradeoffs here.

We switched to iterator access for a toplevel filter, as ofLUCENE-584, but from LUCENE-1476 it's looking like except for fairlysparse filters, random access is much faster.

So I plan to test applying a filter at the top-level w/ iterator (=trunk, baseline), applying filter at top-level w/ random-access,applying filter way at the bottom w/ random access (inSegmentTermDocs, just like deleted docs are done today), acrossdifferent queries and different filter sparseness.


Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Created: (LUCENE-1533) Deleted documents as a Filter or top level Query

Reply via email to