Sounds like you've got almost exactly the MR passes I do when I do this, although I'm not sure because I'm reading this on my phone... I'll look at it closer when I get home.
I do remember I had at least one (maybe two) identity mappers in my sequence. You don't need any custom comparator to do a secondary sort, which I did... not sure where you are getting around not needing that... -jake On Jan 7, 2010 6:47 PM, "Drew Farris" <[email protected]> wrote: This conversation has been pretty inspiring, thanks everyone. I spent some time thinking about the steps involved in the M/R LLR job. I'm a bit of a greenhorn when it comes to this, so it would be great to see what you all think. Here's what I've been able to piece together so far: To perform the LLR calculation, we need 4 values for the combinations of (n-1)grams in ngrams in the input data: A+B, A+!B, B+!A, !A+!B. For the ngram 'the best',with A=the, B=best, this would mean: A+B = the number of times the ngram 'the best' appears A+!B = the number of times 'the' appears in an ngram without 'best' !A+B = the number of times 'best' appears in an ngram without 'the' !A+!B = the number of ngrams that contain neither A or B (are not 'the best') It is also necessary to have N, N = the total number of ngrams (Ted's blog post referenced in the LLR class really helped me understand this concretely, thanks!) Input into the job is done so that the first mapper gets a single entire text document for each map call. This gets runs it through the Analyzer+Lucene ShingleFilter combo. The output of that map task is something like: k:(n-1)gram v:ngram For the input: 'the best of times' The mapper output is: k:the v:the best k:best v:the best k:best v:best of k:of v:best of k:of v:of times k:times v:of times As an aside, the shingles were 'the best', 'best of', 'of times') -- we preserve the total number of ngrams/shingles (N). In this case, it would be 3 In the reducer, we count the number of ngrams each (n-1)gram appears in and the number of times each ngram appears. (I wind up counting each ngram 'n' times, however - There's probably a way around that). Once we have this info, we can output the following from the reducer: k:ngram,ngram-frequency v:(n-1)gram,(n-1) gram freq e.g: k:the best:1, v:best,2 k:best of,1, v:best,2 k:best of,1, v:of,2 k:of times,1 v:of,2 k:the best,1, v:the,1 k:of times,1 v:1 v:times,1 The next mapper could just be a no-op, because we have the data in the right shape to do the LLR in the next reduction pass: k: the best:1, v:best,2; v:the,1 k: best of:1, v:best,2; v:of,2 k: of times:1, v:of,2; v: times:1 (n-1)grams sorted nf = The ngram frequency ln1f = left (n-1)gram frequency rn1f = right (n-1)gram frequency N = Total number of ngrams A+B = nf A+!B = ln1f - nf !A+B = rn1f - nf !A+!B = N - (ln1f + rn1f - nf) With these we calculate LLR using the class on o.a.m.stats and the reducer output can be k:LLR v:ngram Does this work, or did I miss something critical? I've only thought this through for n=2, but I suspect it extends to other cases. I'm curious as to whether it has a problem when both 'best of' and 'of best' occur in the text, but haven't though that through yet. I have the first map/reduce pass implemented using the hadoop 0.19 api, but the code is brutally inefficient in a couple spots, especially because it keeps the *grams as strings. There should likely be a pass to convert them to integers or something to be more compact, right? I'm also wondering about the best way to handle input. Line by line processing would miss ngrams spanning lines, but full document processing with the StandardAnalyzer+ShingleFilter wil form ngrams across sentence boundaries. I followed the WholeFileInputFormat example from the Hadoop in Action book to slurp in data, but I don't like the idea of reading the entire file into a buffer and then wrapping that up into something that can be accessed via the Reader consumed by Lucene's Analyzer. I would much rather pass a Reader/Stream into the mapper if possible. I'm interested in whether there's a more efficient way to structure the M/R passes. It feels a little funny to no-op a whole map cycle. It would almost be better if one could chain two reduces together. On Thu, Jan 7, 2010 at 7:57 PM, Ted Dunning <[email protected]> wrote: > The pieces are laying ...
