eks dev wrote:
Thanks for confirming it.
That is good to know and I am sure there are good reasons for it
(performance). Anyhow, sounds like good mouse trap that probably
deserves a few comments in javadoc.
- From the fact that term exists in term dictionary one cannot
conclude that there are actual documents containing it (people using
external IDs and taking shortcut in checking if document exists in
Index by checking existence in term dictionary; Spell checkers that
index terms from index)...
- Stats are stale and change in time (I have seen comments about it
somewhere)
I agree we should warn about this in the javadocs... can you work up a
patch?
As a luxury option (this all is really not a big deal), maybe an
idea would be to have some sort of lightweight optimize
"refreshStatsAndLexicon()" that just brings stats and term dict into
consistent state, without touching postings / stored fields and
other heavy things?\
That's a neat idea. We can't do this today (the terms dict is "write
once" per segment), but with a small change to allow terms dict to be
rewritten to a different generation file (like how deletes are
handled) we could do this. Not sure how much it'd be used though (I
don't remember users complaining about this on the lists, I think).
Having this clarified, back to the original question, I am now 95%
sure "Deleted Docs as Filters" will be faster (for cases with more
than one term/Clause in Query) or equally fast for single term
queries. 5% uncertainty comes from skipTo() vs get(int i)
performance diff. Imo, this can be visible only for single term
Queries in high density case, maybe not even there...
I plan to run some tests to figure out the performance tradeoffs here.
We switched to iterator access for a toplevel filter, as of
LUCENE-584, but from LUCENE-1476 it's looking like except for fairly
sparse filters, random access is much faster.
So I plan to test applying a filter at the top-level w/ iterator (=
trunk, baseline), applying filter at top-level w/ random-access,
applying filter way at the bottom w/ random access (in
SegmentTermDocs, just like deleted docs are done today), across
different queries and different filter sparseness.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org