Ok, I lied - I think what you described here is way *faster* than what I was doing, because I wasn't starting with the original corpus, I had something like google's ngram terabyte data (a massive HDFS file with just "ngram ngram-frequency" on each line), which mean I had to do a multi-way join (which is where I needed to do a secondary sort by value).
Starting with the corpus itself (the case we're talking about) you have some nice tricks in here: On Thu, Jan 7, 2010 at 6:46 PM, Drew Farris <[email protected]> wrote: > > > The output of that map task is something like: > > k:(n-1)gram v:ngram > This is great right here - it helps you kill two birds with one stone: the join and the wordcount phases. > k:ngram,ngram-frequency v:(n-1)gram,(n-1) gram freq > > e.g: > k:the best:1, v:best,2 > k:best of,1, v:best,2 > k:best of,1, v:of,2 > k:of times,1 v:of,2 > k:the best,1, v:the,1 > k:of times,1 v:1 v:times,1 > Yeah, once you're here, you're home free. This should be really a rather quick set of jobs, even on really big data, and even dealing with it as text. > I'm also wondering about the best way to handle input. Line by line > processing would miss ngrams spanning lines, but full document > processing with the StandardAnalyzer+ShingleFilter wil form ngrams > across sentence boundaries. > These effects are just minor issues: you lose a little bit of signal on line endings, and you pick up some noise catching ngrams across sentence boundaries, but it's fractional compared to your whole set. Don't try and to be too fancy and cram tons of lines together. If your data comes in different chunks than just one huge HDFS text file, you could certainly chunk it into bigger chunks (10, 100, 1000 lines, maybe) to reduce the newline error if necessary, but it's probably not needed. The sentence boundary part gets washed out in the LLR step anyways (because they'll almost always turn out to have a low LLR score). What I've found I've had to do sometimes, is something with stop words. If you don't use stop words at all, you end up getting a lot of relatively high LLR scoring ngrams like "up into", "he would", and in general pairings of a relatively rare unigram with a pronoun or preposition. Maybe there are other ways of avoiding that, but I've found that you do need to take some care with the stop words (but removing them altogether leads to some weird looking ngrams if you want to display them somewhere). > I'm interested in whether there's a more efficient way to structure > the M/R passes. It feels a little funny to no-op a whole map cycle. It > would almost be better if one could chain two reduces together. > Beware premature optimization - try this on a nice big monster set on a real cluster, and see how long it takes. I have a feeling you'll be pleasantly surprised. But even before that - show us a patch, maybe someone will have easy low-hanging fruit optimization tricks. -jake
