Re: How to give consecutive numbers to output records?

Mark Kerzner Wed, 28 Oct 2009 11:21:24 -0700

Brien,


   - I am on EC2, what would be the advantage of using Zookeeper over
   JavaSpaces? Either would have to be maintained by me, as they are not
   provided on EC2 directly;
   - pack that with a map-local counter into a global ID - you mean, just
   take the global counter and make the local instance counter equal to it?
   - 2^53 is quite sufficient for my purposes, but where is the number
   coming from?
   - Looking at your last point, I saw what I have previously missed: I need
   numbers consecutive within each reducer, and then I need them consecutive
   between reducers. I assume that reducers are sorted. For example, if my
   records are sorted 1,2,...6, then one reducer would get maps 1,2,3, and the
   other one - maps 4,5,6. If that's the case, I need to know how the reducers
   are sorted. Then I could simply run the second stage.

Thank you,
Mark

On Wed, Oct 28, 2009 at 1:07 PM, brien colwell <[email protected]> wrote:

> Another approach is to initialize each map task with an ID (using
> JavaSpaces, something like Zookeeper, or some aspect of the input data) and
> then pack that with a map-local counter into a global ID.  This makes
> assumptions like the number of map tasks less than 2^10 and the number of
> records per mapper will be less than 2^53. The packed global IDs are
> consecutive per map task. If globally consecutive is needed, a second stage
> can create a histogram of map task ID -> number of records and use it to
> transform the global IDs to globally consecutive .
>
>
>
>
> Mark Kerzner wrote:
>
>> Michael,
>>
>> environmental variables are available in Java, but the environment itself
>> is
>> not shared between instances. I read your code - you are solving exactly
>> the
>> same problem I am interested in - but I did not see how it works in
>> distributed environment.
>>
>> By the way, it occurs to me that JavaSpaces, which is a different approach
>> to distributed computing, trumpled by Hadoop, could be used here! Just run
>> one instance with GigaSpaces at all times, and you got your self-increment
>> for any number of jobs. It is perfect for concurrent processing and very
>> fast.
>>
>> Thank you,
>> Mark
>>
>> On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <[email protected]
>> >wrote:
>>
>>
>>
>>> I posted an approach to this using streaming, but if the environment
>>> variables are available in standard Java interface, this may work for
>>> you.
>>>
>>> http://www.mail-archive.com/[email protected]/msg09079.html
>>>
>>> You'll have to be able to tolerate some small gaps in the ids.
>>>
>>> Michael
>>>
>>>
>>> Mark Kerzner wrote:
>>>
>>>
>>>
>>>> Aaron, although your notes are not a ready solution, but they are a
>>>> great
>>>> help.
>>>>
>>>> Thank you,
>>>> Mark
>>>>
>>>> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[email protected]>
>>>> wrote:
>>>>
>>>>  There is no in-MapReduce mechanism for cross-task synchronization.
>>>> You'll
>>>>
>>>>
>>>>> need to use something like Zookeeper for this, or another external
>>>>> database.
>>>>> Note that this will greatly complicate your life.
>>>>>
>>>>> If I were you, I'd try to either redesign my pipeline elsewhere to
>>>>> eliminate
>>>>> this need, or maybe get really clever. For example, do your numbers
>>>>> need
>>>>> to
>>>>> be sequential, or just unique?
>>>>>
>>>>> If the latter, then take the byte offset into the reducer's current
>>>>> output
>>>>> file and combine that with the reducer id (e.g.,
>>>>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that
>>>>> they're
>>>>> all
>>>>> building unique sequences. If the former... rethink your pipeline? :)
>>>>>
>>>>> - Aaron
>>>>>
>>>>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[email protected]>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I need to number all output records consecutively, like, 1,2,3...
>>>>>>
>>>>>> This is no problem with one reducer, making recordId an instance
>>>>>>
>>>>>>
>>>>> variable
>>>>>
>>>>>
>>>>>> in
>>>>>> the Reducer class, and setting conf.setNumReduceTasks(1)
>>>>>>
>>>>>> However, it is an architectural decision forced by processing need,
>>>>>>
>>>>>>
>>>>> where
>>>>>
>>>>>
>>>>>> the reducer becomes a bottleneck. Can I have a global variable for all
>>>>>> reducers, which would give each the next consecutive recordId? In the
>>>>>> database scenario, this would be the unique autokey. How to do it in
>>>>>> MapReduce?
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>

Re: How to give consecutive numbers to output records?

Reply via email to