Oh, I see. Very smart.

In my case, I need consecutive numbers with no gaps, and I need them in the
order of how Hadoop sorted the maps. So I don't see how I could apply this
approach, but thank you - it is a great discussion, which was helpful to
consider all issues around this, and which brought about the idea of
JavaSpaces - I am going to check out this one.

Mark

On Wed, Oct 28, 2009 at 12:57 PM, Michael Klatt <[email protected]>wrote:

>
> Hi Mark,
>
> Each mapper (or reducer) has an environment variable "mapred_map_tasks" (or
> "mapred_reduce_tasks") which will describe how many tasks the map or reduce
> job was split into.  It also has a variable
> "mapred_task_id" which contains a unique identifier for the task.  Using
> these two together, it's able to generate a sequence of numbers which won't
> collide with other mappers (or reducers).
>
> For example, if a job has 20 (map or reduce) tasks, then task #1 will
> generate the sequence (assuming the first id is zero):
>
> 0,20,40,60,80,100,120...
>
> and task #2 will generate
>
> 1,21,41,61,81,101,121...
>
> and so on.  This approach works in both the mapper OR the reducer...you
> just have to look at different variables.
>
> Michael
>
>
> Mark Kerzner wrote:
>
>> Michael,
>>
>> environmental variables are available in Java, but the environment itself
>> is
>> not shared between instances. I read your code - you are solving exactly
>> the
>> same problem I am interested in - but I did not see how it works in
>> distributed environment.
>>
>> By the way, it occurs to me that JavaSpaces, which is a different approach
>> to distributed computing, trumpled by Hadoop, could be used here! Just run
>> one instance with GigaSpaces at all times, and you got your self-increment
>> for any number of jobs. It is perfect for concurrent processing and very
>> fast.
>>
>> Thank you,
>> Mark
>>
>> On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <[email protected]
>> >wrote:
>>
>>  I posted an approach to this using streaming, but if the environment
>>> variables are available in standard Java interface, this may work for
>>> you.
>>>
>>> http://www.mail-archive.com/[email protected]/msg09079.html
>>>
>>> You'll have to be able to tolerate some small gaps in the ids.
>>>
>>> Michael
>>>
>>>
>>> Mark Kerzner wrote:
>>>
>>>
>>>> Aaron, although your notes are not a ready solution, but they are a
>>>> great
>>>> help.
>>>>
>>>> Thank you,
>>>> Mark
>>>>
>>>> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[email protected]>
>>>> wrote:
>>>>
>>>>  There is no in-MapReduce mechanism for cross-task synchronization.
>>>> You'll
>>>>
>>>>> need to use something like Zookeeper for this, or another external
>>>>> database.
>>>>> Note that this will greatly complicate your life.
>>>>>
>>>>> If I were you, I'd try to either redesign my pipeline elsewhere to
>>>>> eliminate
>>>>> this need, or maybe get really clever. For example, do your numbers
>>>>> need
>>>>> to
>>>>> be sequential, or just unique?
>>>>>
>>>>> If the latter, then take the byte offset into the reducer's current
>>>>> output
>>>>> file and combine that with the reducer id (e.g.,
>>>>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that
>>>>> they're
>>>>> all
>>>>> building unique sequences. If the former... rethink your pipeline? :)
>>>>>
>>>>> - Aaron
>>>>>
>>>>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[email protected]>
>>>>> wrote:
>>>>>
>>>>>  Hi,
>>>>>>
>>>>>> I need to number all output records consecutively, like, 1,2,3...
>>>>>>
>>>>>> This is no problem with one reducer, making recordId an instance
>>>>>>
>>>>> variable
>>>>>
>>>>>> in
>>>>>> the Reducer class, and setting conf.setNumReduceTasks(1)
>>>>>>
>>>>>> However, it is an architectural decision forced by processing need,
>>>>>>
>>>>> where
>>>>>
>>>>>> the reducer becomes a bottleneck. Can I have a global variable for all
>>>>>> reducers, which would give each the next consecutive recordId? In the
>>>>>> database scenario, this would be the unique autokey. How to do it in
>>>>>> MapReduce?
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>>
>>>>>
>>>>
>>

Reply via email to