On Jan 5, 2010, at 3:18 PM, Ted Dunning wrote:

> No.  We really don't.

FWIW, I checked in math/o.a.m.math.stats.LogLikelihood w/ some based LLR stuff 
that we use in utils.lucene.ClusterLabels.  Would be great to see this stuff 
expanded.

> 
> The most straightforward implementation does a separate pass for computing
> the overall total, for counting the unigrams and then counting the bigrams.
> It is cooler, of course, to count all sizes of ngrams in one pass and output
> them to separate files.  Then a second pass can do a map-side join if the
> unigram table is small enough (it usually is) and compute the results.  All
> of this is very straightforward programming and is a great introduction to
> map-reduce programming.
> 
> On Tue, Jan 5, 2010 at 12:09 PM, Jake Mannix <[email protected]> wrote:
> 
>> Ted, we don't have a MR job to scan through a corpus and ouptut [ngram :
>> LLR]
>> key-value pairs, do we?  I've got one we use at LinkedIn that I could try
>> and pull
>> out if we don't have one.
>> 
>> (I actually used to give this MR job as an interview question, because it's
>> a cute
>> problem you can work out the basics of in not too long).
>> 
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve


Reply via email to