Hi all,
I'm using analyzers that insert several tokens at the same position
(positionIncrement=0), and I noticed that the calculation of lengthNorm
takes into account all tokens, no matter what is their position.
Example:
- input string: "tree houses"
- analyzed: tree, houses|house
- lengthNorm(field, 3)
- input string: "tree house"
- analyzed: tree, house
- lengthNorm(field, 2)
This however leads to some counter-intuitive results: for a query "tree"
the second document will have a higher score, i.e. the first document
will be penalized for the additional terms at the same positions.
Arguably this should not happen, i.e. additional terms inserted at the
same positions should be treated as an artificial construct equivalent
in length to a single token, and not intended to increase the length of
the field, but rather to increase the probability of a successful match.
[Side-note: The actual use case is more complicated, because it involves
using accent-stripping filters that insert additional pure-ASCII tokens,
and using different analyzers at index and query time. Users are allowed
to make queries using either accented or ASCII input, and they should
get comparable scores from documents with pure ascii field (no
additional tokens) and from accented fields (many additional tokens with
ascii|accented|stemmed variants).]
On the other hand, if someone were to submit a query 'house OR houses',
using analyzer that doesn't perform stemming, the first document should
have a higher score than the second (and this is already ensured by the
fact that two terms match instead of one), but this score should be
mitigated by the increased length to reflect the fact that there are
more terms in total in this field ...
Current behavior can be changed by changing DocInverterPerField so that
it increments fieldState.length only for tokens with positionIncrement >
0. This could be controlled by an option - IMHO conceptually this option
belongs to Similarity, and should be specific to a field, so perhaps a
new method in Similarity like this would do:
public float lengthNorm(String fieldName,
int numTokens, int numOverlappingTokens) {
return lengthNorm(fieldName, numTokens);
}
What do you think?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]