Re: Collocations in Mahout?

Shashikant Kore Fri, 08 Jan 2010 04:45:02 -0800

On Fri, Jan 8, 2010 at 10:36 AM, Robin Anil <[email protected]> wrote:
>
> One interesting thing I found was that any ngram with LLR <1 is practically
> junk, anything over LLR>50 is pretty awesome. between 1-50, its always
> debatable. This holds approximately true for large and small datasets.
>


I don't think the absolute value of LLR score is an indicator of
importance of a term across all dataset.

With corpus of million documents, if I calculate LLR score of terms in
a set of say 50,000 documents, I get hundreds of terms with score more
than 50, many of which are not "useful."

Ted, can you please comment on Robin's observation?

--shashi

Re: Collocations in Mahout?

Reply via email to