On Tue, Feb 16, 2010 at 11:13 AM, Jason Rennie <[email protected]> wrote:
> Am I incorrect in thinking that the events used for LLR here are the > occurrences of the individual terms in a bigram? I'm looking here: > > > http://svn.apache.org/viewvc/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java?view=markup > Here is my take on the matter: http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html The events are occurrences of word A (and complementarily, any non-A word) in the first position and word B (and non-B words) in the second position. > I don't follow the argument that tf-idf is an approximation of LLR. Are > you > referring to the Papineni paper? > No. I was referring to my own napkin scribblings. If you expand the LLR the score that uses events of word A/not A against in this document/in other documents, you find count(A in this document) log (count of A in other documents) as one of the dominant terms in the expression. This is nearly identical to tf*log(idf) in terms of the sort order imposed on terms. -- Ted Dunning, CTO DeepDyve
