On Tue, Feb 16, 2010 at 11:13 AM, Jason Rennie <[email protected]> wrote:

> Am I incorrect in thinking that the events used for LLR here are the
> occurrences of the individual terms in a bigram?  I'm looking here:
>
>
> http://svn.apache.org/viewvc/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java?view=markup
>

Here is my take on the matter:
http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

The events are occurrences of word A (and complementarily, any non-A word)
in the first position and word B (and non-B words) in the second position.


> I don't follow the argument that tf-idf is an approximation of LLR.  Are
> you
> referring to the Papineni paper?
>

No.  I was referring to my own napkin scribblings.  If you expand the LLR
the score that uses events of word A/not A against in this document/in other
documents, you find count(A in this document) log (count of A in other
documents) as one of the dominant terms in the expression.  This is nearly
identical to tf*log(idf) in terms of the sort order imposed on terms.


-- 
Ted Dunning, CTO
DeepDyve

Reply via email to