Re: Collocations in Mahout?

Robin Anil Thu, 07 Jan 2010 21:07:28 -0800

On Fri, Jan 8, 2010 at 7:03 AM, Grant Ingersoll <[email protected]> wrote:

>
> On Jan 7, 2010, at 7:57 PM, Ted Dunning wrote:
>
> > The pieces are laying around.
> >
> > I had a framework like this for recs and text analysis at Veoh, Jake has
> > something in LinkedIn.
> >
> > But the amount of code is relatively small and probably could be
> rewritten
> > before Jake can get clearance to release anything.
> >
> > The first step is to just count n-grams.  I think that the input should
> be
> > relatively flexible and if you assume parametrized use of Lucene
> analyzers,
> > then all that is necessary is a small step up from word counting.
>
> The classification stuff has this already, in MR form, independent of
> Lucene.
>
> > This
> > should count all n-grams from 0 up to a limit.  It should also allow
> > suppression of output of any counts less than a threshold.  Total number
> of
> > n-grams of each size observed should be accumulated.
>
> I believe it does this, too.  Robin?
>
Yeah, Brute force ngram generation is done by Bayes Classifier. Beware its
practically combinatorial explosion of data. But enough machines can tame it
well.

Take a look at the DictionaryVectorizer . If LLR job could be added in a
chain, I could use that information while creating vectors.
https://issues.apache.org/jira/browse/MAHOUT-237

I like the Formulation that Drew made, using n-1 grams to generate n-grams.
It was the same I used to generate n-grams here,
http://thinking.me/(Himanshu and I built it when I was still in college).
But that was just a php script which iterates over a sample of twitter data
:).
One interesting thing I found was that any ngram with LLR <1 is practically
junk, anything over LLR>50 is pretty awesome. between 1-50, its always
debatable. This holds approximately true for large and small datasets.

I will be really happy if Drew can work on the LLR based bigram generation
code and help me attach it with the rest of the dictionaryVectorizer

Also I would prefer if the the entire mahout code agrees upon a single
 format for document input.  I would suggest we stick to SequenceFiles with
key as docid and value as document content. That way, the creation of
sequence files, we leave it to the user.

Robin

Re: Collocations in Mahout?

Reply via email to