Re: Collocations in Mahout?

Jake Mannix Thu, 07 Jan 2010 19:07:48 -0800

Sounds like you've got almost exactly the MR passes I do when I do this,
although I'm not sure because I'm reading this on my phone... I'll look at
it closer when I get home.


I do remember I had at least one (maybe two) identity mappers in my
sequence.  You don't need any custom comparator to do a secondary sort,
which I did... not sure where you are getting around not needing that...

  -jake

On Jan 7, 2010 6:47 PM, "Drew Farris" <[email protected]> wrote:

This conversation has been pretty inspiring, thanks everyone.

I spent some time thinking about the steps involved in the M/R LLR
job. I'm a bit of a greenhorn when it comes to this, so it would be
great to see what you all think. Here's what I've been able to piece
together so far:

To perform the LLR calculation, we need 4 values for the combinations
of (n-1)grams in ngrams in the input data: A+B, A+!B, B+!A, !A+!B. For
the ngram 'the best',with A=the, B=best, this would mean:

A+B = the number of times the ngram 'the best' appears
A+!B = the number of times 'the' appears in an ngram without 'best'
!A+B = the number of times 'best' appears in an ngram without 'the'
!A+!B = the number of ngrams that contain neither A or B (are not 'the
best')

It is also necessary to have N,
N = the total number of ngrams

(Ted's blog post referenced in the LLR class really helped me
understand this concretely, thanks!)

Input into the job is done so that the first mapper gets a single
entire text document for each map call. This gets runs it through the
Analyzer+Lucene ShingleFilter combo.

The output of that map task is something like:

k:(n-1)gram v:ngram

For the input: 'the best of times'
The mapper output is:

k:the v:the best
k:best v:the best
k:best v:best of
k:of    v:best of
k:of    v:of times
k:times v:of times

As an aside, the shingles were 'the best', 'best of', 'of times') --
we preserve the total number of ngrams/shingles (N). In this case, it
would be 3

In the reducer, we count the number of ngrams each (n-1)gram appears
in and the number of times each ngram appears. (I wind up counting
each ngram 'n' times, however - There's probably a way around that).
Once we have this info, we can output the following from the reducer:

k:ngram,ngram-frequency v:(n-1)gram,(n-1) gram freq

e.g:
k:the best:1, v:best,2
k:best of,1, v:best,2
k:best of,1, v:of,2
k:of times,1 v:of,2
k:the best,1, v:the,1
k:of times,1 v:1 v:times,1

The next mapper could just be a no-op, because we have the data in the
right shape to do the LLR in the next reduction pass:

k: the best:1, v:best,2; v:the,1
k: best of:1, v:best,2; v:of,2
k: of times:1, v:of,2; v: times:1

(n-1)grams sorted

nf = The ngram frequency
ln1f = left (n-1)gram frequency
rn1f = right (n-1)gram frequency
N = Total number of ngrams

A+B = nf
A+!B = ln1f - nf
!A+B = rn1f - nf
!A+!B = N - (ln1f + rn1f - nf)

With these we calculate LLR using the class on o.a.m.stats and the
reducer output can be

k:LLR v:ngram

Does this work, or did I miss something critical?  I've only thought
this through for n=2, but I suspect it extends to other cases. I'm
curious as to whether it has a problem when both 'best of' and 'of
best' occur in the text, but haven't though that through yet.

I have the first map/reduce pass implemented using the hadoop 0.19
api, but the code is brutally inefficient in a couple spots,
especially because it keeps the *grams as strings. There should likely
be a pass to convert them to integers or something to be more compact,
right?

I'm also wondering about the best way to handle input. Line by line
processing would miss ngrams spanning lines, but full document
processing with the StandardAnalyzer+ShingleFilter wil form ngrams
across sentence boundaries.

I followed the WholeFileInputFormat example from the Hadoop in Action
book to slurp in data, but I don't like the idea of reading the entire
file into a buffer and then wrapping that up into something that can
be accessed via the Reader consumed by Lucene's Analyzer. I would much
rather pass a Reader/Stream into the mapper if possible.

I'm interested in whether there's a more efficient way to structure
the M/R passes. It feels a little funny to no-op a whole map cycle. It
would almost be better if one could chain two reduces together.

On Thu, Jan 7, 2010 at 7:57 PM, Ted Dunning <[email protected]> wrote: >
The pieces are laying ...

Re: Collocations in Mahout?

Reply via email to