Stats in CustomScoreProvider + (in)correctness of LMDirichletSimilarity

Stephen Wu Sat, 02 May 2015 13:36:00 -0700

I am having trouble getting collection probabilities for a term to show up
in a CustomScoreQuery/CustomScoreProvider.  Basically, I am trying to add a
per-document weight that amounts to the sum (for each term in the query) of
Math.log(collectionProbability).  Can anyone help with this?


Or feel free to suggest a better way to do this.  Here's a description...

-----
LMDirichletSimilarity is not consistent with the original equations, as
many have noted.  Here's how it's different under two

1. *Swap in LMDirichletSimilarity* in place of some other similarity, but
modify the scoring function.  Ignoring the boost, it is currently
implemented as:
    term_score_current = Math.log(1 + freq /
        (mu * collectionProbability)) +
        Math.log(mu / (docLen + mu))

If you do this, there are two problems.  The first problem is that the
score is off by a factor of Math.log(collectionProbability).  Do the math
<http://en.wikipedia.org/wiki/List_of_logarithmic_identities>: if you add
that in, you will get something equal to form of the original formulation
(e.g., in Zhai and Lafferty 2001).  For reference, that looks like:
    term_score_official = Math.log( (freq+mu*collectionProbability) /
(docLen+mu) )

If you add that factor, though, the second problem arises.  That
Math.log(collectionProbability) factor does not get added for terms that
don't MATCH with a document because .score() doesn't get called if there's
no MATCH.  This is basically the problem that Ronan Cummins wrote about a
few weeks ago.

2. *Leave LMDirichletSimilarity as it is* but *add a factor* to every final
score that is returned*.*  (Note: you'd also need to remove the
non-negative score restriction in LMDirichletSimilarity.)  This would be
the sum of the log collection probabilities for each term:
    query_score = sum(term_score_current) +
sum(Math.log(collectionProbability))

As some have mentioned, this is basically an additive version of a
queryNorm.  It seems like the right way to do this is to wrap each query in
a modified CustomScoreQuery accessing a CustomScoreProvider, which would
then add that "constant" factor across all documents.  However, this
"constant" factor needs to be computed from statistics; how can this be
done?  Those statistics are available in LMDirichletSimilarity, but it is
less clear how to find those statistics directly from a Query object.

stephen

Stats in CustomScoreProvider + (in)correctness of LMDirichletSimilarity

Reply via email to