Re: short documents = help me tweak Similarity??

Andrew Hudson Thu, 05 Apr 2007 22:05:18 -0700

> Also, i don't understand why the encode/decode functions have a range of

7x10^9 to 2x10^-9, when it seems to me the most common values are (boosts
set to 1.0) something between 1.0 and 0.  When would somebody have a monster
huge value like 7x10^9?  Even with a huge index time boost of 20.0 or
something, why would the encode/decode need a range as huge as the current
implementation?


I have often asked myself the same thing, I have just tried to avoid
depending on the field norms if possible.  For instance, if you have
your own array of how long each of your fields are you can just boost
the documents however you want in your HitCollector by looking up the
value in your array using the docId.  That is the approach we have
generally taken in our application.  You can get how many terms are in
each field by creating an array of length maxDoc and then iterating
over all of the TermPositions for that field and remembering the
maximum position that you saw for each document.  This array is also
useful for implementing exact phrase matching, so suppose someone
wants documents that match *exactly* "Nissan Altima", you would do a
phrase search for "Nissan Altima" and then just ignore all the results
that do not have exactly two terms in that field.  For example "Nissan
Altima Standard" would match that query but you would see in your
array that it has 3 terms, when you only care about results that have
2 terms.  But you have to implement your own HitCollector object and
use that instead of using the "Hits" interface.  To get an idea of how
to do that you can look at the HitCollector that the Hits object uses.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: short documents = help me tweak Similarity??

Reply via email to