Re: Similarity.lengthNorm and positionIncrement=0

Michael McCandless Sun, 12 Oct 2008 11:26:47 -0700

I agree we should make this possible. A field should not be"penalized" just because many of its terms had synonyms.

In your proposed method addition to Similarity, below,numOverlappingTokens would count the number of tokens that hadpositionIncrement==0? And then that default impl is fully backwardscompatible since it falls back to the current approach of counting theoverlapping tokens when computing lengthNorm?

Maybe in 3.0 we should then switch it to not count overlapping tokensby default.


Mike

Andrzej Bialecki wrote:

Hi all,
I'm using analyzers that insert several tokens at the same position(positionIncrement=0), and I noticed that the calculation oflengthNorm takes into account all tokens, no matter what is theirposition.
Example:
        - input string: "tree houses"
        - analyzed:     tree, houses|house
        - lengthNorm(field, 3)

        - input string: "tree house"
        - analyzed:     tree, house
        - lengthNorm(field, 2)
This however leads to some counter-intuitive results: for a query"tree" the second document will have a higher score, i.e. the firstdocument will be penalized for the additional terms at the samepositions.
Arguably this should not happen, i.e. additional terms inserted atthe same positions should be treated as an artificial constructequivalent in length to a single token, and not intended to increasethe length of the field, but rather to increase the probability of asuccessful match.
[Side-note: The actual use case is more complicated, because itinvolves using accent-stripping filters that insert additional pure-ASCII tokens, and using different analyzers at index and query time.Users are allowed to make queries using either accented or ASCIIinput, and they should get comparable scores from documents withpure ascii field (no additional tokens) and from accented fields(many additional tokens with ascii|accented|stemmed variants).]
On the other hand, if someone were to submit a query 'house ORhouses', using analyzer that doesn't perform stemming, the firstdocument should have a higher score than the second (and this isalready ensured by the fact that two terms match instead of one),but this score should be mitigated by the increased length toreflect the fact that there are more terms in total in this field ...
Current behavior can be changed by changing DocInverterPerField sothat it increments fieldState.length only for tokens withpositionIncrement > 0. This could be controlled by an option - IMHOconceptually this option belongs to Similarity, and should bespecific to a field, so perhaps a new method in Similarity like thiswould do:
        public float lengthNorm(String fieldName,
                 int numTokens, int numOverlappingTokens) {

                return lengthNorm(fieldName, numTokens);
        }

What do you think?

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Similarity.lengthNorm and positionIncrement=0

Reply via email to