Similarity.lengthNorm and positionIncrement=0

Andrzej Bialecki Tue, 07 Oct 2008 13:15:01 -0700

Hi all,

I'm using analyzers that insert several tokens at the same position(positionIncrement=0), and I noticed that the calculation of lengthNormtakes into account all tokens, no matter what is their position.


Example:
        - input string: "tree houses"
        - analyzed:     tree, houses|house
        - lengthNorm(field, 3)

        - input string: "tree house"
        - analyzed:     tree, house
        - lengthNorm(field, 2)

This however leads to some counter-intuitive results: for a query "tree"the second document will have a higher score, i.e. the first documentwill be penalized for the additional terms at the same positions.

Arguably this should not happen, i.e. additional terms inserted at thesame positions should be treated as an artificial construct equivalentin length to a single token, and not intended to increase the length ofthe field, but rather to increase the probability of a successful match.

[Side-note: The actual use case is more complicated, because it involvesusing accent-stripping filters that insert additional pure-ASCII tokens,and using different analyzers at index and query time. Users are allowedto make queries using either accented or ASCII input, and they shouldget comparable scores from documents with pure ascii field (noadditional tokens) and from accented fields (many additional tokens withascii|accented|stemmed variants).]

On the other hand, if someone were to submit a query 'house OR houses',using analyzer that doesn't perform stemming, the first document shouldhave a higher score than the second (and this is already ensured by thefact that two terms match instead of one), but this score should bemitigated by the increased length to reflect the fact that there aremore terms in total in this field ...

Current behavior can be changed by changing DocInverterPerField so thatit increments fieldState.length only for tokens with positionIncrement >0. This could be controlled by an option - IMHO conceptually this optionbelongs to Similarity, and should be specific to a field, so perhaps anew method in Similarity like this would do:


        public float lengthNorm(String fieldName,
                 int numTokens, int numOverlappingTokens) {

                return lengthNorm(fieldName, numTokens);
        }

What do you think?

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Similarity.lengthNorm and positionIncrement=0

Reply via email to