I agree, Edward. One Reducer is what I have now, and it works. It is, or may become, a bottleneck. When it does, I can go back to using multiple reducers, and add a second MR job to renumber. That way, I can stick with the situation until it becomes necessary to optimize, and not introduce any changes or new software - for, as you say, it might become a pain. Follow the rule (which I got from Scott Meyers' Effective C++) - avoid premature optimization.
mark On Wed, Oct 28, 2009 at 2:22 PM, Edward Capriolo <[email protected]>wrote: > On Wed, Oct 28, 2009 at 2:20 PM, Mark Kerzner <[email protected]> > wrote: > > Brien, > > > > > > - I am on EC2, what would be the advantage of using Zookeeper over > > JavaSpaces? Either would have to be maintained by me, as they are not > > provided on EC2 directly; > > - pack that with a map-local counter into a global ID - you mean, just > > take the global counter and make the local instance counter equal to > it? > > - 2^53 is quite sufficient for my purposes, but where is the number > > coming from? > > - Looking at your last point, I saw what I have previously missed: I > need > > numbers consecutive within each reducer, and then I need them > consecutive > > between reducers. I assume that reducers are sorted. For example, if my > > records are sorted 1,2,...6, then one reducer would get maps 1,2,3, and > the > > other one - maps 4,5,6. If that's the case, I need to know how the > reducers > > are sorted. Then I could simply run the second stage. > > > > Thank you, > > Mark > > > > On Wed, Oct 28, 2009 at 1:07 PM, brien colwell <[email protected]> > wrote: > > > >> Another approach is to initialize each map task with an ID (using > >> JavaSpaces, something like Zookeeper, or some aspect of the input data) > and > >> then pack that with a map-local counter into a global ID. This makes > >> assumptions like the number of map tasks less than 2^10 and the number > of > >> records per mapper will be less than 2^53. The packed global IDs are > >> consecutive per map task. If globally consecutive is needed, a second > stage > >> can create a histogram of map task ID -> number of records and use it to > >> transform the global IDs to globally consecutive . > >> > >> > >> > >> > >> Mark Kerzner wrote: > >> > >>> Michael, > >>> > >>> environmental variables are available in Java, but the environment > itself > >>> is > >>> not shared between instances. I read your code - you are solving > exactly > >>> the > >>> same problem I am interested in - but I did not see how it works in > >>> distributed environment. > >>> > >>> By the way, it occurs to me that JavaSpaces, which is a different > approach > >>> to distributed computing, trumpled by Hadoop, could be used here! Just > run > >>> one instance with GigaSpaces at all times, and you got your > self-increment > >>> for any number of jobs. It is perfect for concurrent processing and > very > >>> fast. > >>> > >>> Thank you, > >>> Mark > >>> > >>> On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt < > [email protected] > >>> >wrote: > >>> > >>> > >>> > >>>> I posted an approach to this using streaming, but if the environment > >>>> variables are available in standard Java interface, this may work for > >>>> you. > >>>> > >>>> http://www.mail-archive.com/[email protected]/msg09079.html > >>>> > >>>> You'll have to be able to tolerate some small gaps in the ids. > >>>> > >>>> Michael > >>>> > >>>> > >>>> Mark Kerzner wrote: > >>>> > >>>> > >>>> > >>>>> Aaron, although your notes are not a ready solution, but they are a > >>>>> great > >>>>> help. > >>>>> > >>>>> Thank you, > >>>>> Mark > >>>>> > >>>>> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[email protected]> > >>>>> wrote: > >>>>> > >>>>> There is no in-MapReduce mechanism for cross-task synchronization. > >>>>> You'll > >>>>> > >>>>> > >>>>>> need to use something like Zookeeper for this, or another external > >>>>>> database. > >>>>>> Note that this will greatly complicate your life. > >>>>>> > >>>>>> If I were you, I'd try to either redesign my pipeline elsewhere to > >>>>>> eliminate > >>>>>> this need, or maybe get really clever. For example, do your numbers > >>>>>> need > >>>>>> to > >>>>>> be sequential, or just unique? > >>>>>> > >>>>>> If the latter, then take the byte offset into the reducer's current > >>>>>> output > >>>>>> file and combine that with the reducer id (e.g., > >>>>>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that > >>>>>> they're > >>>>>> all > >>>>>> building unique sequences. If the former... rethink your pipeline? > :) > >>>>>> > >>>>>> - Aaron > >>>>>> > >>>>>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner < > [email protected]> > >>>>>> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> I need to number all output records consecutively, like, 1,2,3... > >>>>>>> > >>>>>>> This is no problem with one reducer, making recordId an instance > >>>>>>> > >>>>>>> > >>>>>> variable > >>>>>> > >>>>>> > >>>>>>> in > >>>>>>> the Reducer class, and setting conf.setNumReduceTasks(1) > >>>>>>> > >>>>>>> However, it is an architectural decision forced by processing need, > >>>>>>> > >>>>>>> > >>>>>> where > >>>>>> > >>>>>> > >>>>>>> the reducer becomes a bottleneck. Can I have a global variable for > all > >>>>>>> reducers, which would give each the next consecutive recordId? In > the > >>>>>>> database scenario, this would be the unique autokey. How to do it > in > >>>>>>> MapReduce? > >>>>>>> > >>>>>>> Thank you > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >>> > >> > >> > > > > My two cents here. > > It seems like what you want is an atomic auto-id in order with no > gaps. In particular the "no gaps" and "in order" requirement really > constrains you because now you can't use the map/reduce id s/hostname > etc to generate distinct keys. > > If you have these requirements "in order" "no gaps". The best > approach may just be to use one mapper/reducer. Take a step back and > look at the big picture. No matter how much software you add, you > asking for a single/serialized/locking "IDFACTORY". Doing this > multinode zookeeper locking "IDFACTORY" is probably going to be more > work and slower then just doing it in a single thread mapper or > reducer. >
