Hi Jim,

The ability to perform locking of shared mutable state is a distinct
anti-goal of the MapReduce paradigm. One of the major benefits of writing
MapReduce programs is knowing that you don't have to worry about deadlock in
your code. If mappers could lock objects, then the failure and restart
semantics of individual tasks would be vastly more complicated. (What
happens if a map task crashes after it obtains a lock? Does it automatically
release the lock? Does some rollback mechanism undo everything that happened
after the lock was acquired? How would that work if--by definition--the
mapper node is no longer available?)

A word frequency histogram function can certainly be written in MapReduce
without such state. You've got the right intuition, but a serial program is
not necessarily the best answer. Take the existing word count program. This
converts bags of words into (word, count) pairs. Then pass this through a
second pass, via an identity mapper to a set of combiners that each emit the
100 most frequent words, to a single reducer that emits the 100 most
frequent words obtained by the combiners.

Many other more complicated problems which seem to require shared state, in
truth, only require a second (or n+1'th) MapReduce pass. Adding multiple
passes is a very valid technique for building more complex dataflows.

Cheers,
- Aaron


On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky <[email protected]> wrote:

> Hello,
>
> I was wondering if Hadoop provides thread safe shared variables that can be
> accessed from individual mappers/reducers along with a proper locking
> mechanism. To clarify things, let's say that in the word count example, I
> want to know the word that has the highest frequency and how many times it
> occured. I believe that the latter can be done using the counters that come
> with the Hadoop framework but I don't know how to get the word itself as a
> String. Of course, the problem can be more complicated like the top 100
> words or so.
>
> I thought of writing a serial program which can go over the final output of
> the word count but this wouldn't be a good idea if the output file gets too
> large. However, if there is a way to define and use shared variables, this
> would be really easy to do on the fly during the word count's reduce phase.
>
> Thanks,
> Jim
>

Reply via email to