On Wed, Jan 6, 2010 at 3:35 PM, Grant Ingersoll <[email protected]> wrote:
>
> On Jan 5, 2010, at 3:18 PM, Ted Dunning wrote:
>
>> No.  We really don't.
>
> FWIW, I checked in math/o.a.m.math.stats.LogLikelihood w/ some based LLR 
> stuff that we use in utils.lucene.ClusterLabels.  Would be great to see this 
> stuff expanded.
>

So, doing something like this would involve some number of M/R passes
to do the ngram generation, counting and calculate LLR using
o.a.m.math.stats.LogLikelihood, but what to do about tokenization?

I've seen the approach of using a list of filenames as input to the
first mapper, which slurps in and tokenizes / generating ngrams for
the text of each file, but is there something that works better?

Would Lucene's StandardAnalyzer be sufficient for generating tokens?

Reply via email to