OK, this & Andrzej's logic makes sense -- let's add it as an option,
but leave the default to the current approach of counting all tokens
towards length norm.
Mike
Nadav Har'El wrote:
On Sun, Oct 12, 2008, Michael McCandless wrote about "Re:
Similarity.lengthNorm and positionIncrement=0":
I agree we should make this possible. A field should not be
"penalized" just because many of its terms had synonyms.
I guess it won't do any harm to make this an option, but we need to
do some
careful thinking before making this the default, or even encouraging
it.
If we recall the rationale of length normalization, it is not to
"penalize"
long documents, in the sense that users are less likely to want to
see long
documents. Rather, the idea is that a long document contains more
words -
more unique words and more repetitions of each word - so long
documents are
more likely to match any query, and more likely to have higher
scores for
each query. If you don't do length normalization, (almost) no matter
what
search you preform, you'll get the longest documents back, rather
than the
really best-matching documents. This is why length normalization is
necessary.
Now, if we do synonym expension during indexing, the document *really*
becomes longer - it now (possibly) contains more unique words and more
repetitions thereof. So it actually makes sense, I think, to count
also
these synonyms, and not try to avoid it.
But you're right - if we're not talking about real synonyms, but
rather
variants which will *never* be used in the same query (ASCII vs.
accented
in your case), it does make sense not to count them twice, so it might
indeed be useful to have this prosed behavior as an option.
Anyway, this is just my opinion (not backed by any hard research or
experimentation), so it might be wrong.
--
Nadav Har'El | Monday, Oct 13 2008, 14
Tishri 5769
IBM Haifa Research Lab
|-----------------------------------------
|Windows-2000/Professional isn't.
http://nadav.harel.org.il |
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]