On Wed, Oct 28, 2009 at 2:20 PM, Mark Kerzner <[email protected]> wrote: > Brien, > > > - I am on EC2, what would be the advantage of using Zookeeper over > JavaSpaces? Either would have to be maintained by me, as they are not > provided on EC2 directly; > - pack that with a map-local counter into a global ID - you mean, just > take the global counter and make the local instance counter equal to it? > - 2^53 is quite sufficient for my purposes, but where is the number > coming from? > - Looking at your last point, I saw what I have previously missed: I need > numbers consecutive within each reducer, and then I need them consecutive > between reducers. I assume that reducers are sorted. For example, if my > records are sorted 1,2,...6, then one reducer would get maps 1,2,3, and the > other one - maps 4,5,6. If that's the case, I need to know how the reducers > are sorted. Then I could simply run the second stage. > > Thank you, > Mark > > On Wed, Oct 28, 2009 at 1:07 PM, brien colwell <[email protected]> wrote: > >> Another approach is to initialize each map task with an ID (using >> JavaSpaces, something like Zookeeper, or some aspect of the input data) and >> then pack that with a map-local counter into a global ID. This makes >> assumptions like the number of map tasks less than 2^10 and the number of >> records per mapper will be less than 2^53. The packed global IDs are >> consecutive per map task. If globally consecutive is needed, a second stage >> can create a histogram of map task ID -> number of records and use it to >> transform the global IDs to globally consecutive . >> >> >> >> >> Mark Kerzner wrote: >> >>> Michael, >>> >>> environmental variables are available in Java, but the environment itself >>> is >>> not shared between instances. I read your code - you are solving exactly >>> the >>> same problem I am interested in - but I did not see how it works in >>> distributed environment. >>> >>> By the way, it occurs to me that JavaSpaces, which is a different approach >>> to distributed computing, trumpled by Hadoop, could be used here! Just run >>> one instance with GigaSpaces at all times, and you got your self-increment >>> for any number of jobs. It is perfect for concurrent processing and very >>> fast. >>> >>> Thank you, >>> Mark >>> >>> On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <[email protected] >>> >wrote: >>> >>> >>> >>>> I posted an approach to this using streaming, but if the environment >>>> variables are available in standard Java interface, this may work for >>>> you. >>>> >>>> http://www.mail-archive.com/[email protected]/msg09079.html >>>> >>>> You'll have to be able to tolerate some small gaps in the ids. >>>> >>>> Michael >>>> >>>> >>>> Mark Kerzner wrote: >>>> >>>> >>>> >>>>> Aaron, although your notes are not a ready solution, but they are a >>>>> great >>>>> help. >>>>> >>>>> Thank you, >>>>> Mark >>>>> >>>>> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[email protected]> >>>>> wrote: >>>>> >>>>> There is no in-MapReduce mechanism for cross-task synchronization. >>>>> You'll >>>>> >>>>> >>>>>> need to use something like Zookeeper for this, or another external >>>>>> database. >>>>>> Note that this will greatly complicate your life. >>>>>> >>>>>> If I were you, I'd try to either redesign my pipeline elsewhere to >>>>>> eliminate >>>>>> this need, or maybe get really clever. For example, do your numbers >>>>>> need >>>>>> to >>>>>> be sequential, or just unique? >>>>>> >>>>>> If the latter, then take the byte offset into the reducer's current >>>>>> output >>>>>> file and combine that with the reducer id (e.g., >>>>>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that >>>>>> they're >>>>>> all >>>>>> building unique sequences. If the former... rethink your pipeline? :) >>>>>> >>>>>> - Aaron >>>>>> >>>>>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[email protected]> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I need to number all output records consecutively, like, 1,2,3... >>>>>>> >>>>>>> This is no problem with one reducer, making recordId an instance >>>>>>> >>>>>>> >>>>>> variable >>>>>> >>>>>> >>>>>>> in >>>>>>> the Reducer class, and setting conf.setNumReduceTasks(1) >>>>>>> >>>>>>> However, it is an architectural decision forced by processing need, >>>>>>> >>>>>>> >>>>>> where >>>>>> >>>>>> >>>>>>> the reducer becomes a bottleneck. Can I have a global variable for all >>>>>>> reducers, which would give each the next consecutive recordId? In the >>>>>>> database scenario, this would be the unique autokey. How to do it in >>>>>>> MapReduce? >>>>>>> >>>>>>> Thank you >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >> >> >
My two cents here. It seems like what you want is an atomic auto-id in order with no gaps. In particular the "no gaps" and "in order" requirement really constrains you because now you can't use the map/reduce id s/hostname etc to generate distinct keys. If you have these requirements "in order" "no gaps". The best approach may just be to use one mapper/reducer. Take a step back and look at the big picture. No matter how much software you add, you asking for a single/serialized/locking "IDFACTORY". Doing this multinode zookeeper locking "IDFACTORY" is probably going to be more work and slower then just doing it in a single thread mapper or reducer.
