On Jan 6, 2010, at 3:52 PM, Drew Farris wrote: > On Wed, Jan 6, 2010 at 3:35 PM, Grant Ingersoll <[email protected]> wrote: >> >> On Jan 5, 2010, at 3:18 PM, Ted Dunning wrote: >> >>> No. We really don't. >> >> FWIW, I checked in math/o.a.m.math.stats.LogLikelihood w/ some based LLR >> stuff that we use in utils.lucene.ClusterLabels. Would be great to see this >> stuff expanded. >> > > So, doing something like this would involve some number of M/R passes > to do the ngram generation, counting and calculate LLR using > o.a.m.math.stats.LogLikelihood, but what to do about tokenization? > > I've seen the approach of using a list of filenames as input to the > first mapper, which slurps in and tokenizes / generating ngrams for > the text of each file, but is there something that works better? > > Would Lucene's StandardAnalyzer be sufficient for generating tokens?
Why not be able to pass in the Analyzer? I think the classifier stuff does, assuming it takes a no-arg constructor, which many do. It's the one place, however, where I think we could benefit from something like Spring or Guice.
