On Tue, Jan 5, 2010 at 12:18 PM, Ted Dunning <[email protected]> wrote:

> No.  We really don't.
>
> The most straightforward implementation does a separate pass for computing
> the overall total, for counting the unigrams and then counting the bigrams.
> It is cooler, of course, to count all sizes of ngrams in one pass and
> output
> them to separate files.  Then a second pass can do a map-side join if the
> unigram table is small enough (it usually is) and compute the results.  All
> of this is very straightforward programming and is a great introduction to
> map-reduce programming.
>

Oh you and your map-side join!  That's no fun - to pass my interview you
need
to do it in the case where the unigrams *don't* fit in memory (because lets
say you wanted to compute a variation of LLR for trigrams which used the
sub-bigram counts in the computation).  It's much more fun to do an n-way
self-join with composite keys and secondary sorts! :)

  -jake

Reply via email to