I agree, Edward. One Reducer is what I have now, and it works. It is, or may
become, a bottleneck. When it does, I can go back to using multiple
reducers, and add a second MR job to renumber. That way, I can stick with
the situation until it becomes necessary to optimize, and not introduce any
changes or new software - for, as you say, it might become a pain. Follow
the rule (which I got from Scott Meyers' Effective C++) - avoid premature
optimization.

mark

On Wed, Oct 28, 2009 at 2:22 PM, Edward Capriolo <[email protected]>wrote:

> On Wed, Oct 28, 2009 at 2:20 PM, Mark Kerzner <[email protected]>
> wrote:
> > Brien,
> >
> >
> >   - I am on EC2, what would be the advantage of using Zookeeper over
> >   JavaSpaces? Either would have to be maintained by me, as they are not
> >   provided on EC2 directly;
> >   - pack that with a map-local counter into a global ID - you mean, just
> >   take the global counter and make the local instance counter equal to
> it?
> >   - 2^53 is quite sufficient for my purposes, but where is the number
> >   coming from?
> >   - Looking at your last point, I saw what I have previously missed: I
> need
> >   numbers consecutive within each reducer, and then I need them
> consecutive
> >   between reducers. I assume that reducers are sorted. For example, if my
> >   records are sorted 1,2,...6, then one reducer would get maps 1,2,3, and
> the
> >   other one - maps 4,5,6. If that's the case, I need to know how the
> reducers
> >   are sorted. Then I could simply run the second stage.
> >
> > Thank you,
> > Mark
> >
> > On Wed, Oct 28, 2009 at 1:07 PM, brien colwell <[email protected]>
> wrote:
> >
> >> Another approach is to initialize each map task with an ID (using
> >> JavaSpaces, something like Zookeeper, or some aspect of the input data)
> and
> >> then pack that with a map-local counter into a global ID.  This makes
> >> assumptions like the number of map tasks less than 2^10 and the number
> of
> >> records per mapper will be less than 2^53. The packed global IDs are
> >> consecutive per map task. If globally consecutive is needed, a second
> stage
> >> can create a histogram of map task ID -> number of records and use it to
> >> transform the global IDs to globally consecutive .
> >>
> >>
> >>
> >>
> >> Mark Kerzner wrote:
> >>
> >>> Michael,
> >>>
> >>> environmental variables are available in Java, but the environment
> itself
> >>> is
> >>> not shared between instances. I read your code - you are solving
> exactly
> >>> the
> >>> same problem I am interested in - but I did not see how it works in
> >>> distributed environment.
> >>>
> >>> By the way, it occurs to me that JavaSpaces, which is a different
> approach
> >>> to distributed computing, trumpled by Hadoop, could be used here! Just
> run
> >>> one instance with GigaSpaces at all times, and you got your
> self-increment
> >>> for any number of jobs. It is perfect for concurrent processing and
> very
> >>> fast.
> >>>
> >>> Thank you,
> >>> Mark
> >>>
> >>> On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <
> [email protected]
> >>> >wrote:
> >>>
> >>>
> >>>
> >>>> I posted an approach to this using streaming, but if the environment
> >>>> variables are available in standard Java interface, this may work for
> >>>> you.
> >>>>
> >>>> http://www.mail-archive.com/[email protected]/msg09079.html
> >>>>
> >>>> You'll have to be able to tolerate some small gaps in the ids.
> >>>>
> >>>> Michael
> >>>>
> >>>>
> >>>> Mark Kerzner wrote:
> >>>>
> >>>>
> >>>>
> >>>>> Aaron, although your notes are not a ready solution, but they are a
> >>>>> great
> >>>>> help.
> >>>>>
> >>>>> Thank you,
> >>>>> Mark
> >>>>>
> >>>>> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>  There is no in-MapReduce mechanism for cross-task synchronization.
> >>>>> You'll
> >>>>>
> >>>>>
> >>>>>> need to use something like Zookeeper for this, or another external
> >>>>>> database.
> >>>>>> Note that this will greatly complicate your life.
> >>>>>>
> >>>>>> If I were you, I'd try to either redesign my pipeline elsewhere to
> >>>>>> eliminate
> >>>>>> this need, or maybe get really clever. For example, do your numbers
> >>>>>> need
> >>>>>> to
> >>>>>> be sequential, or just unique?
> >>>>>>
> >>>>>> If the latter, then take the byte offset into the reducer's current
> >>>>>> output
> >>>>>> file and combine that with the reducer id (e.g.,
> >>>>>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that
> >>>>>> they're
> >>>>>> all
> >>>>>> building unique sequences. If the former... rethink your pipeline?
> :)
> >>>>>>
> >>>>>> - Aaron
> >>>>>>
> >>>>>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <
> [email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I need to number all output records consecutively, like, 1,2,3...
> >>>>>>>
> >>>>>>> This is no problem with one reducer, making recordId an instance
> >>>>>>>
> >>>>>>>
> >>>>>> variable
> >>>>>>
> >>>>>>
> >>>>>>> in
> >>>>>>> the Reducer class, and setting conf.setNumReduceTasks(1)
> >>>>>>>
> >>>>>>> However, it is an architectural decision forced by processing need,
> >>>>>>>
> >>>>>>>
> >>>>>> where
> >>>>>>
> >>>>>>
> >>>>>>> the reducer becomes a bottleneck. Can I have a global variable for
> all
> >>>>>>> reducers, which would give each the next consecutive recordId? In
> the
> >>>>>>> database scenario, this would be the unique autokey. How to do it
> in
> >>>>>>> MapReduce?
> >>>>>>>
> >>>>>>> Thank you
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
>
> My two cents here.
>
> It seems like what you want is an atomic auto-id in order with no
> gaps. In particular the "no gaps" and "in order" requirement really
> constrains you because now you can't use the map/reduce id s/hostname
> etc to generate distinct keys.
>
> If you have these requirements "in order" "no gaps".  The best
> approach may just be to use one mapper/reducer. Take a step back and
> look at the big picture. No matter how much software you add, you
> asking for a single/serialized/locking "IDFACTORY". Doing this
> multinode zookeeper locking "IDFACTORY" is probably going to be more
> work and slower then just doing it in a single thread mapper or
> reducer.
>

Reply via email to