On Fri, Jan 8, 2010 at 10:36 AM, Robin Anil <[email protected]> wrote: > > One interesting thing I found was that any ngram with LLR <1 is practically > junk, anything over LLR>50 is pretty awesome. between 1-50, its always > debatable. This holds approximately true for large and small datasets. >
I don't think the absolute value of LLR score is an indicator of importance of a term across all dataset. With corpus of million documents, if I calculate LLR score of terms in a set of say 50,000 documents, I get hundreds of terms with score more than 50, many of which are not "useful." Ted, can you please comment on Robin's observation? --shashi
