Re: How to give consecutive numbers to output records?

brien colwell Wed, 28 Oct 2009 11:08:25 -0700

Another approach is to initialize each map task with an ID (usingJavaSpaces, something like Zookeeper, or some aspect of the input data)and then pack that with a map-local counter into a global ID. Thismakes assumptions like the number of map tasks less than 2^10 and thenumber of records per mapper will be less than 2^53. The packed globalIDs are consecutive per map task. If globally consecutive is needed, asecond stage can create a histogram of map task ID -> number of recordsand use it to transform the global IDs to globally consecutive .



Mark Kerzner wrote:

Michael,

environmental variables are available in Java, but the environment itself is
not shared between instances. I read your code - you are solving exactly the
same problem I am interested in - but I did not see how it works in
distributed environment.

By the way, it occurs to me that JavaSpaces, which is a different approach
to distributed computing, trumpled by Hadoop, could be used here! Just run
one instance with GigaSpaces at all times, and you got your self-increment
for any number of jobs. It is perfect for concurrent processing and very
fast.

Thank you,
Mark

On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <[email protected]>wrote:

I posted an approach to this using streaming, but if the environment
variables are available in standard Java interface, this may work for you.

http://www.mail-archive.com/[email protected]/msg09079.html

You'll have to be able to tolerate some small gaps in the ids.

Michael


Mark Kerzner wrote:

Aaron, although your notes are not a ready solution, but they are a great
help.

Thank you,
Mark

On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[email protected]>
wrote:

 There is no in-MapReduce mechanism for cross-task synchronization. You'll

need to use something like Zookeeper for this, or another external
database.
Note that this will greatly complicate your life.

If I were you, I'd try to either redesign my pipeline elsewhere to
eliminate
this need, or maybe get really clever. For example, do your numbers need
to
be sequential, or just unique?

If the latter, then take the byte offset into the reducer's current
output
file and combine that with the reducer id (e.g.,
<current-byte-offset><zero-padded-reducer-id>) to guarantee that they're
all
building unique sequences. If the former... rethink your pipeline? :)

- Aaron

On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[email protected]>
wrote:

Hi,

I need to number all output records consecutively, like, 1,2,3...

This is no problem with one reducer, making recordId an instance

variable

in
the Reducer class, and setting conf.setNumReduceTasks(1)

However, it is an architectural decision forced by processing need,

where

the reducer becomes a bottleneck. Can I have a global variable for all
reducers, which would give each the next consecutive recordId? In the
database scenario, this would be the unique autokey. How to do it in
MapReduce?

Thank you

Re: How to give consecutive numbers to output records?

Reply via email to