Hello again, I think I found an answer to my question. If I write a new WritableComparable object that extends IntWritable and then overwrite the compareTo method, I can change the sorting order from ascending to descending. That will solve my problem for getting the top 100 most frequent words at each combiner/reducer.
Jim On Wed, Dec 24, 2008 at 12:19 PM, Jim Twensky <[email protected]> wrote: > Hi Aaron, > > Thanks for the advice. I actually thought of using multiple combiners and a > single reducer but I was worried about the key sorting phase to be a vaste > for my purpose. If the input is just a bunch of (word,count) pairs which is > in the order of TeraBytes, wouldn't sorting be an overkill? That's why I > thought a single serial program might perform better but I'm not sure how > long it would take to sort the keys in such a case so probably it is nothing > beyond speculation and I should go and give it a try to see how well it > performs. > > Secondly, I didn't quite understand how I can take advantage of the sorted > keys if I use an inverting mapper that transforms (k,v) --> (v,k) pairs. In > both cases, the combiners and the single reducer will still have to iterate > over all the (v,k) pairs to find the top 100 right? Or is there a way to say > something like "Give me the last 100 keys" at each reducer/combiner? > > Thanks in advance, > Jim > > > On Wed, Dec 24, 2008 at 3:44 AM, Aaron Kimball <[email protected]> wrote: > >> (Addendum to my own post -- an identity mapper is probably not what you >> want. You'd actually want an inverting mapper that transforms (k, v) --> >> (v, >> k), to take advantage of the key-based sorting.) >> >> - Aaron >> >> On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball <[email protected]> >> wrote: >> >> > Hi Jim, >> > >> > The ability to perform locking of shared mutable state is a distinct >> > anti-goal of the MapReduce paradigm. One of the major benefits of >> writing >> > MapReduce programs is knowing that you don't have to worry about >> deadlock in >> > your code. If mappers could lock objects, then the failure and restart >> > semantics of individual tasks would be vastly more complicated. (What >> > happens if a map task crashes after it obtains a lock? Does it >> automatically >> > release the lock? Does some rollback mechanism undo everything that >> happened >> > after the lock was acquired? How would that work if--by definition--the >> > mapper node is no longer available?) >> > >> > A word frequency histogram function can certainly be written in >> MapReduce >> > without such state. You've got the right intuition, but a serial program >> is >> > not necessarily the best answer. Take the existing word count program. >> This >> > converts bags of words into (word, count) pairs. Then pass this through >> a >> > second pass, via an identity mapper to a set of combiners that each emit >> the >> > 100 most frequent words, to a single reducer that emits the 100 most >> > frequent words obtained by the combiners. >> > >> > Many other more complicated problems which seem to require shared state, >> in >> > truth, only require a second (or n+1'th) MapReduce pass. Adding multiple >> > passes is a very valid technique for building more complex dataflows. >> > >> > Cheers, >> > - Aaron >> > >> > >> > >> > On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky <[email protected] >> >wrote: >> > >> >> Hello, >> >> >> >> I was wondering if Hadoop provides thread safe shared variables that >> can >> >> be >> >> accessed from individual mappers/reducers along with a proper locking >> >> mechanism. To clarify things, let's say that in the word count example, >> I >> >> want to know the word that has the highest frequency and how many times >> it >> >> occured. I believe that the latter can be done using the counters that >> >> come >> >> with the Hadoop framework but I don't know how to get the word itself >> as a >> >> String. Of course, the problem can be more complicated like the top 100 >> >> words or so. >> >> >> >> I thought of writing a serial program which can go over the final >> output >> >> of >> >> the word count but this wouldn't be a good idea if the output file gets >> >> too >> >> large. However, if there is a way to define and use shared variables, >> this >> >> would be really easy to do on the fly during the word count's reduce >> >> phase. >> >> >> >> Thanks, >> >> Jim >> >> >> > >> > >> > >
