No. We really don't. The most straightforward implementation does a separate pass for computing the overall total, for counting the unigrams and then counting the bigrams. It is cooler, of course, to count all sizes of ngrams in one pass and output them to separate files. Then a second pass can do a map-side join if the unigram table is small enough (it usually is) and compute the results. All of this is very straightforward programming and is a great introduction to map-reduce programming.
On Tue, Jan 5, 2010 at 12:09 PM, Jake Mannix <[email protected]> wrote: > Ted, we don't have a MR job to scan through a corpus and ouptut [ngram : > LLR] > key-value pairs, do we? I've got one we use at LinkedIn that I could try > and pull > out if we don't have one. > > (I actually used to give this MR job as an interview question, because it's > a cute > problem you can work out the basics of in not too long). > -- Ted Dunning, CTO DeepDyve
