eks dev wrote:

Thanks for confirming it.

That is good to know and I am sure there are good reasons for it (performance). Anyhow, sounds like good mouse trap that probably deserves a few comments in javadoc.

- From the fact that term exists in term dictionary one cannot conclude that there are actual documents containing it (people using external IDs and taking shortcut in checking if document exists in Index by checking existence in term dictionary; Spell checkers that index terms from index)...

- Stats are stale and change in time (I have seen comments about it somewhere)

I agree we should warn about this in the javadocs... can you work up a patch?

As a luxury option (this all is really not a big deal), maybe an idea would be to have some sort of lightweight optimize "refreshStatsAndLexicon()" that just brings stats and term dict into consistent state, without touching postings / stored fields and other heavy things?\

That's a neat idea. We can't do this today (the terms dict is "write once" per segment), but with a small change to allow terms dict to be rewritten to a different generation file (like how deletes are handled) we could do this. Not sure how much it'd be used though (I don't remember users complaining about this on the lists, I think).

Having this clarified, back to the original question, I am now 95% sure "Deleted Docs as Filters" will be faster (for cases with more than one term/Clause in Query) or equally fast for single term queries. 5% uncertainty comes from skipTo() vs get(int i) performance diff. Imo, this can be visible only for single term Queries in high density case, maybe not even there...

I plan to run some tests to figure out the performance tradeoffs here.

We switched to iterator access for a toplevel filter, as of LUCENE-584, but from LUCENE-1476 it's looking like except for fairly sparse filters, random access is much faster.

So I plan to test applying a filter at the top-level w/ iterator (= trunk, baseline), applying filter at top-level w/ random-access, applying filter way at the bottom w/ random access (in SegmentTermDocs, just like deleted docs are done today), across different queries and different filter sparseness.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to