On Sun, Oct 12, 2008, Michael McCandless wrote about "Re: Similarity.lengthNorm and positionIncrement=0": > > I agree we should make this possible. A field should not be > "penalized" just because many of its terms had synonyms.
I guess it won't do any harm to make this an option, but we need to do some careful thinking before making this the default, or even encouraging it. If we recall the rationale of length normalization, it is not to "penalize" long documents, in the sense that users are less likely to want to see long documents. Rather, the idea is that a long document contains more words - more unique words and more repetitions of each word - so long documents are more likely to match any query, and more likely to have higher scores for each query. If you don't do length normalization, (almost) no matter what search you preform, you'll get the longest documents back, rather than the really best-matching documents. This is why length normalization is necessary. Now, if we do synonym expension during indexing, the document *really* becomes longer - it now (possibly) contains more unique words and more repetitions thereof. So it actually makes sense, I think, to count also these synonyms, and not try to avoid it. But you're right - if we're not talking about real synonyms, but rather variants which will *never* be used in the same query (ASCII vs. accented in your case), it does make sense not to count them twice, so it might indeed be useful to have this prosed behavior as an option. Anyway, this is just my opinion (not backed by any hard research or experimentation), so it might be wrong. -- Nadav Har'El | Monday, Oct 13 2008, 14 Tishri 5769 IBM Haifa Research Lab |----------------------------------------- |Windows-2000/Professional isn't. http://nadav.harel.org.il | --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]