Oh, I see. Very smart. In my case, I need consecutive numbers with no gaps, and I need them in the order of how Hadoop sorted the maps. So I don't see how I could apply this approach, but thank you - it is a great discussion, which was helpful to consider all issues around this, and which brought about the idea of JavaSpaces - I am going to check out this one.
Mark On Wed, Oct 28, 2009 at 12:57 PM, Michael Klatt <[email protected]>wrote: > > Hi Mark, > > Each mapper (or reducer) has an environment variable "mapred_map_tasks" (or > "mapred_reduce_tasks") which will describe how many tasks the map or reduce > job was split into. It also has a variable > "mapred_task_id" which contains a unique identifier for the task. Using > these two together, it's able to generate a sequence of numbers which won't > collide with other mappers (or reducers). > > For example, if a job has 20 (map or reduce) tasks, then task #1 will > generate the sequence (assuming the first id is zero): > > 0,20,40,60,80,100,120... > > and task #2 will generate > > 1,21,41,61,81,101,121... > > and so on. This approach works in both the mapper OR the reducer...you > just have to look at different variables. > > Michael > > > Mark Kerzner wrote: > >> Michael, >> >> environmental variables are available in Java, but the environment itself >> is >> not shared between instances. I read your code - you are solving exactly >> the >> same problem I am interested in - but I did not see how it works in >> distributed environment. >> >> By the way, it occurs to me that JavaSpaces, which is a different approach >> to distributed computing, trumpled by Hadoop, could be used here! Just run >> one instance with GigaSpaces at all times, and you got your self-increment >> for any number of jobs. It is perfect for concurrent processing and very >> fast. >> >> Thank you, >> Mark >> >> On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <[email protected] >> >wrote: >> >> I posted an approach to this using streaming, but if the environment >>> variables are available in standard Java interface, this may work for >>> you. >>> >>> http://www.mail-archive.com/[email protected]/msg09079.html >>> >>> You'll have to be able to tolerate some small gaps in the ids. >>> >>> Michael >>> >>> >>> Mark Kerzner wrote: >>> >>> >>>> Aaron, although your notes are not a ready solution, but they are a >>>> great >>>> help. >>>> >>>> Thank you, >>>> Mark >>>> >>>> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <[email protected]> >>>> wrote: >>>> >>>> There is no in-MapReduce mechanism for cross-task synchronization. >>>> You'll >>>> >>>>> need to use something like Zookeeper for this, or another external >>>>> database. >>>>> Note that this will greatly complicate your life. >>>>> >>>>> If I were you, I'd try to either redesign my pipeline elsewhere to >>>>> eliminate >>>>> this need, or maybe get really clever. For example, do your numbers >>>>> need >>>>> to >>>>> be sequential, or just unique? >>>>> >>>>> If the latter, then take the byte offset into the reducer's current >>>>> output >>>>> file and combine that with the reducer id (e.g., >>>>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that >>>>> they're >>>>> all >>>>> building unique sequences. If the former... rethink your pipeline? :) >>>>> >>>>> - Aaron >>>>> >>>>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <[email protected]> >>>>> wrote: >>>>> >>>>> Hi, >>>>>> >>>>>> I need to number all output records consecutively, like, 1,2,3... >>>>>> >>>>>> This is no problem with one reducer, making recordId an instance >>>>>> >>>>> variable >>>>> >>>>>> in >>>>>> the Reducer class, and setting conf.setNumReduceTasks(1) >>>>>> >>>>>> However, it is an architectural decision forced by processing need, >>>>>> >>>>> where >>>>> >>>>>> the reducer becomes a bottleneck. Can I have a global variable for all >>>>>> reducers, which would give each the next consecutive recordId? In the >>>>>> database scenario, this would be the unique autokey. How to do it in >>>>>> MapReduce? >>>>>> >>>>>> Thank you >>>>>> >>>>>> >>>>> >>>> >>
