I am having trouble getting collection probabilities for a term to show up
in a CustomScoreQuery/CustomScoreProvider. Basically, I am trying to add a
per-document weight that amounts to the sum (for each term in the query) of
Math.log(collectionProbability). Can anyone help with this?
Or feel free to suggest a better way to do this. Here's a description...
-----
LMDirichletSimilarity is not consistent with the original equations, as
many have noted. Here's how it's different under two
1. *Swap in LMDirichletSimilarity* in place of some other similarity, but
modify the scoring function. Ignoring the boost, it is currently
implemented as:
term_score_current = Math.log(1 + freq /
(mu * collectionProbability)) +
Math.log(mu / (docLen + mu))
If you do this, there are two problems. The first problem is that the
score is off by a factor of Math.log(collectionProbability). Do the math
<http://en.wikipedia.org/wiki/List_of_logarithmic_identities>: if you add
that in, you will get something equal to form of the original formulation
(e.g., in Zhai and Lafferty 2001). For reference, that looks like:
term_score_official = Math.log( (freq+mu*collectionProbability) /
(docLen+mu) )
If you add that factor, though, the second problem arises. That
Math.log(collectionProbability) factor does not get added for terms that
don't MATCH with a document because .score() doesn't get called if there's
no MATCH. This is basically the problem that Ronan Cummins wrote about a
few weeks ago.
2. *Leave LMDirichletSimilarity as it is* but *add a factor* to every final
score that is returned*.* (Note: you'd also need to remove the
non-negative score restriction in LMDirichletSimilarity.) This would be
the sum of the log collection probabilities for each term:
query_score = sum(term_score_current) +
sum(Math.log(collectionProbability))
As some have mentioned, this is basically an additive version of a
queryNorm. It seems like the right way to do this is to wrap each query in
a modified CustomScoreQuery accessing a CustomScoreProvider, which would
then add that "constant" factor across all documents. However, this
"constant" factor needs to be computed from statistics; how can this be
done? Those statistics are available in LMDirichletSimilarity, but it is
less clear how to find those statistics directly from a Query object.
stephen