Ideally yea, I think it would be nice to be able to pass in a custom
analyzer or at least be able to provide some options... I saw the
LogLikelihood class Grant was referring to in math.stats but I don't seem to
see any M/R LLR piece, at least not something that's nicely abstracted and
extracted out.

@Ted, where is the partial framework you're referring to. And yes this is
definitely something I would like to work on if pointed in the right
direction. I wasn't quite sure though just b/c I remember a long-winded
discussion/debate a while back on the listserv about what Mahout's purpose
should be. N-gram LLR for collocations seems like a very NLP type of thing
to have (obviously it could also be used in other applications as well but
by itself its NLP to me) and from my understanding the "consensus" is that
Mahout should focus on scalable machine learning.

On Wed, Jan 6, 2010 at 4:04 PM, Grant Ingersoll <[email protected]> wrote:

>
> On Jan 6, 2010, at 3:52 PM, Drew Farris wrote:
>
> > On Wed, Jan 6, 2010 at 3:35 PM, Grant Ingersoll <[email protected]>
> wrote:
> >>
> >> On Jan 5, 2010, at 3:18 PM, Ted Dunning wrote:
> >>
> >>> No.  We really don't.
> >>
> >> FWIW, I checked in math/o.a.m.math.stats.LogLikelihood w/ some based LLR
> stuff that we use in utils.lucene.ClusterLabels.  Would be great to see this
> stuff expanded.
> >>
> >
> > So, doing something like this would involve some number of M/R passes
> > to do the ngram generation, counting and calculate LLR using
> > o.a.m.math.stats.LogLikelihood, but what to do about tokenization?
> >
> > I've seen the approach of using a list of filenames as input to the
> > first mapper, which slurps in and tokenizes / generating ngrams for
> > the text of each file, but is there something that works better?
> >
> > Would Lucene's StandardAnalyzer be sufficient for generating tokens?
>
> Why not be able to pass in the Analyzer?  I think the classifier stuff
> does, assuming it takes a no-arg constructor, which many do.  It's the one
> place, however, where I think we could benefit from something like Spring or
> Guice.




-- 
Zaki Rahaman

Reply via email to