Re: How to give consecutive numbers to output records?

Edward Capriolo Wed, 28 Oct 2009 12:23:03 -0700

On Wed, Oct 28, 2009 at 2:20 PM, Mark Kerzner <[email protected]> wrote:
> Brien,
>
>
>   - I am on EC2, what would be the advantage of using Zookeeper over
>   JavaSpaces? Either would have to be maintained by me, as they are not
>   provided on EC2 directly;
>   - pack that with a map-local counter into a global ID - you mean, just
>   take the global counter and make the local instance counter equal to it?
>   - 2^53 is quite sufficient for my purposes, but where is the number
>   coming from?
>   - Looking at your last point, I saw what I have previously missed: I need
>   numbers consecutive within each reducer, and then I need them consecutive
>   between reducers. I assume that reducers are sorted. For example, if my
>   records are sorted 1,2,...6, then one reducer would get maps 1,2,3, and the
>   other one - maps 4,5,6. If that's the case, I need to know how the reducers
>   are sorted. Then I could simply run the second stage.
>
> Thank you,
> Mark
>
> On Wed, Oct 28, 2009 at 1:07 PM, brien colwell <[email protected]> wrote:
>
>> Another approach is to initialize each map task with an ID (using
>> JavaSpaces, something like Zookeeper, or some aspect of the input data) and
>> then pack that with a map-local counter into a global ID.  This makes
>> assumptions like the number of map tasks less than 2^10 and the number of
>> records per mapper will be less than 2^53. The packed global IDs are
>> consecutive per map task. If globally consecutive is needed, a second stage
>> can create a histogram of map task ID -> number of records and use it to
>> transform the global IDs to globally consecutive .
>>
>>
>>
>>
>> Mark Kerzner wrote:
>>
>>> Michael,
>>>
>>> environmental variables are available in Java, but the environment itself
>>> is
>>> not shared between instances. I read your code - you are solving exactly
>>> the
>>> same problem I am interested in - but I did not see how it works in
>>> distributed environment.
>>>
>>> By the way, it occurs to me that JavaSpaces, which is a different approach
>>> to distributed computing, trumpled by Hadoop, could be used here! Just run
>>> one instance with GigaSpaces at all times, and you got your self-increment
>>> for any number of jobs. It is perfect for concurrent processing and very
>>> fast.
>>>
>>> Thank you,
>>> Mark
>>>
>>> On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <[email protected]
>>> >wrote:
>>>
>>>
>>>
>>>> I posted an approach to this using streaming, but if the environment
>>>> variables are available in standard Java interface, this may work for
>>>> you.
>>>>
>>>> http://www.mail-archive.com/[email protected]/msg09079.html
>>>>
>>>> You'll have to be able to tolerate some small gaps in the ids.
>>>>
>>>> Michael
>>>>
>>>>
>>>> Mark Kerzner wrote:
>>>>
>>>>
>>>>
>>>>> Aaron, although your notes are not a ready solution, but they are a
>>>>> great
>>>>> help.
>>>>>
>>>>> Thank you,
>>>>> Mark
>>>>>
>>>>> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[email protected]>
>>>>> wrote:
>>>>>
>>>>>  There is no in-MapReduce mechanism for cross-task synchronization.
>>>>> You'll
>>>>>
>>>>>
>>>>>> need to use something like Zookeeper for this, or another external
>>>>>> database.
>>>>>> Note that this will greatly complicate your life.
>>>>>>
>>>>>> If I were you, I'd try to either redesign my pipeline elsewhere to
>>>>>> eliminate
>>>>>> this need, or maybe get really clever. For example, do your numbers
>>>>>> need
>>>>>> to
>>>>>> be sequential, or just unique?
>>>>>>
>>>>>> If the latter, then take the byte offset into the reducer's current
>>>>>> output
>>>>>> file and combine that with the reducer id (e.g.,
>>>>>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that
>>>>>> they're
>>>>>> all
>>>>>> building unique sequences. If the former... rethink your pipeline? :)
>>>>>>
>>>>>> - Aaron
>>>>>>
>>>>>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I need to number all output records consecutively, like, 1,2,3...
>>>>>>>
>>>>>>> This is no problem with one reducer, making recordId an instance
>>>>>>>
>>>>>>>
>>>>>> variable
>>>>>>
>>>>>>
>>>>>>> in
>>>>>>> the Reducer class, and setting conf.setNumReduceTasks(1)
>>>>>>>
>>>>>>> However, it is an architectural decision forced by processing need,
>>>>>>>
>>>>>>>
>>>>>> where
>>>>>>
>>>>>>
>>>>>>> the reducer becomes a bottleneck. Can I have a global variable for all
>>>>>>> reducers, which would give each the next consecutive recordId? In the
>>>>>>> database scenario, this would be the unique autokey. How to do it in
>>>>>>> MapReduce?
>>>>>>>
>>>>>>> Thank you
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>


My two cents here.

It seems like what you want is an atomic auto-id in order with no
gaps. In particular the "no gaps" and "in order" requirement really
constrains you because now you can't use the map/reduce id s/hostname
etc to generate distinct keys.

If you have these requirements "in order" "no gaps".  The best
approach may just be to use one mapper/reducer. Take a step back and
look at the big picture. No matter how much software you add, you
asking for a single/serialized/locking "IDFACTORY". Doing this
multinode zookeeper locking "IDFACTORY" is probably going to be more
work and slower then just doing it in a single thread mapper or
reducer.

Re: How to give consecutive numbers to output records?

Reply via email to