Brien,
- I am on EC2, what would be the advantage of using Zookeeper over JavaSpaces? Either would have to be maintained by me, as they are not provided on EC2 directly; - pack that with a map-local counter into a global ID - you mean, just take the global counter and make the local instance counter equal to it? - 2^53 is quite sufficient for my purposes, but where is the number coming from? - Looking at your last point, I saw what I have previously missed: I need numbers consecutive within each reducer, and then I need them consecutive between reducers. I assume that reducers are sorted. For example, if my records are sorted 1,2,...6, then one reducer would get maps 1,2,3, and the other one - maps 4,5,6. If that's the case, I need to know how the reducers are sorted. Then I could simply run the second stage. Thank you, Mark On Wed, Oct 28, 2009 at 1:07 PM, brien colwell <[email protected]> wrote: > Another approach is to initialize each map task with an ID (using > JavaSpaces, something like Zookeeper, or some aspect of the input data) and > then pack that with a map-local counter into a global ID. This makes > assumptions like the number of map tasks less than 2^10 and the number of > records per mapper will be less than 2^53. The packed global IDs are > consecutive per map task. If globally consecutive is needed, a second stage > can create a histogram of map task ID -> number of records and use it to > transform the global IDs to globally consecutive . > > > > > Mark Kerzner wrote: > >> Michael, >> >> environmental variables are available in Java, but the environment itself >> is >> not shared between instances. I read your code - you are solving exactly >> the >> same problem I am interested in - but I did not see how it works in >> distributed environment. >> >> By the way, it occurs to me that JavaSpaces, which is a different approach >> to distributed computing, trumpled by Hadoop, could be used here! Just run >> one instance with GigaSpaces at all times, and you got your self-increment >> for any number of jobs. It is perfect for concurrent processing and very >> fast. >> >> Thank you, >> Mark >> >> On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <[email protected] >> >wrote: >> >> >> >>> I posted an approach to this using streaming, but if the environment >>> variables are available in standard Java interface, this may work for >>> you. >>> >>> http://www.mail-archive.com/[email protected]/msg09079.html >>> >>> You'll have to be able to tolerate some small gaps in the ids. >>> >>> Michael >>> >>> >>> Mark Kerzner wrote: >>> >>> >>> >>>> Aaron, although your notes are not a ready solution, but they are a >>>> great >>>> help. >>>> >>>> Thank you, >>>> Mark >>>> >>>> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[email protected]> >>>> wrote: >>>> >>>> There is no in-MapReduce mechanism for cross-task synchronization. >>>> You'll >>>> >>>> >>>>> need to use something like Zookeeper for this, or another external >>>>> database. >>>>> Note that this will greatly complicate your life. >>>>> >>>>> If I were you, I'd try to either redesign my pipeline elsewhere to >>>>> eliminate >>>>> this need, or maybe get really clever. For example, do your numbers >>>>> need >>>>> to >>>>> be sequential, or just unique? >>>>> >>>>> If the latter, then take the byte offset into the reducer's current >>>>> output >>>>> file and combine that with the reducer id (e.g., >>>>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that >>>>> they're >>>>> all >>>>> building unique sequences. If the former... rethink your pipeline? :) >>>>> >>>>> - Aaron >>>>> >>>>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[email protected]> >>>>> wrote: >>>>> >>>>> >>>>> >>>>>> Hi, >>>>>> >>>>>> I need to number all output records consecutively, like, 1,2,3... >>>>>> >>>>>> This is no problem with one reducer, making recordId an instance >>>>>> >>>>>> >>>>> variable >>>>> >>>>> >>>>>> in >>>>>> the Reducer class, and setting conf.setNumReduceTasks(1) >>>>>> >>>>>> However, it is an architectural decision forced by processing need, >>>>>> >>>>>> >>>>> where >>>>> >>>>> >>>>>> the reducer becomes a bottleneck. Can I have a global variable for all >>>>>> reducers, which would give each the next consecutive recordId? In the >>>>>> database scenario, this would be the unique autokey. How to do it in >>>>>> MapReduce? >>>>>> >>>>>> Thank you >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >> >> > >
