Re: Weird discrepancy with term counts vs. terms (off by 1)

Phil Whelan Sun, 02 Aug 2009 09:09:08 -0700

Hi Jim,

On Sun, Aug 2, 2009 at 1:32 AM, <oh...@cox.net> wrote:
> I first noticed the problem that I'm seeing while working on this latter app. 
>  Basically, what I noticed was that while I was adding 13 documents to the 
> index, when I listed the "path" terms, there were only 12 of them.


Field text (the whole "path" in your case) and terms (the tokens of
the field text) are different.

The StandardAnalyzer breaks up words like this...
Field text = "/a/b/c.txt"
Tokens = {"a","b","c","txt"}

So this 1 field of 1 document become 4 terms / tokens (not sure if
there is a difference in this terminology between "terms" and "tokens"
sorry).
Therefore, you're going to have more terms than documents initially,
but as the overlap in term usage increases this changes.

For instance, these 3 paths
"/a/b/c/d.txt","/b/c/d/a.txt","/c/d/a/b.txt" are still only a total of
4 terms, since they share the same terms.

In fact, StandardAnalyzer goes a bit further than that and removes
"stop-words", such as "a" (or "an", "the") as it's designed for
general text searching.

That said, I think you have a point with the next part of your question...

> So then, I reviewed the index using Luke, and what I saw with that was that 
> there were indeed only 12 "path" terms (under "Term Count" on the left), but, 
> when I clicked the "Show Top Terms" in Luke, there were 13 terms listed by 
> Luke.

Yes, I just checked this and this seems to be a bug with Luke. It
always shows 1 less than in "Term Count" than it should. Well spotted.

Cheers,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Weird discrepancy with term counts vs. terms (off by 1)

Reply via email to