Assuming N is not too large in the sense that your reducers can keep a tree map of N elements, then you can have your reducer maintain the top N elements in a tree-map (or a priority queue, or a heap, whatever), with counts as keys in the tree-map. As the reducers progress, you throw away the items with smaller counts, and always keep the top N seen so far in the tree-map. At the end of the reducing process, each reducer outputs the elements in its tree-map, which should be the top N in that reducer. If your job has only one reducer, then the output is your final answer. Otherwise, you can use a second job (or any tools) to merge/sort the R output files (each has the top N elements for the corresponding reducer). If N * R << input size, the second step should take much less time than the first one.
On Fri, Sep 10, 2010 at 5:37 PM, Neil Ghosh <[email protected]> wrote: > Hi Alex , > > Thanks so much for the reply . As of now I don't have any issue with 2 > Jobs.I was just making sure that I am not missing any obvious way of > writing > the program in one job.I will get back if I need to optimize on performance > based on specific pattern of input. > > Thank you so much you all for helping me on this issue. > Neil > > On Sat, Sep 11, 2010 at 5:56 AM, Alex Kozlov <[email protected]> wrote: > > > Hi Neil, > > > > Uniques and Top N, as well as percentiles, are inherently difficult to > > distribute/parallelize since you have to have a global view of the > dataset. > > You can optimize the computations given some assumptions about the input > > (the # of unique values, prevalence of the most frequent value larger > than > > certain threshold, etc.). There is no way to avoid two jobs in a general > > case. > > > > Can you specify your problem more precisely and assumptions, if any? > > > > -- > > Alex Kozlov > > Solutions Architect > > Cloudera, Inc > > twitter: alexvk2009 > > > > Hadoop World 2010, October 12, New York City - Register now: > > http://www.cloudera.com/company/press-center/hadoop-world-nyc/ > > > > On Fri, Sep 10, 2010 at 5:14 PM, Neil Ghosh <[email protected]> > wrote: > > > >> Thanks Aaron. I employed two Jobs and solved the problem. > >> > >> I was just wondering is there anyway , it can be done in single job so > >> that > >> disk/network I/O is less and no temporary storage is required between > 1st > >> and second job. > >> > >> Neil > >> > >> On Sat, Sep 11, 2010 at 4:37 AM, Aaron Baff <[email protected]> > >> wrote: > >> > >> > I'm still fairly new at MapReduce, but here's my thoughts the > solution. > >> > > >> > Use the Item as the Key, the Count as the Value, in the Reducer, sum > up > >> all > >> > of the Count's and output the Item,sum(Count). To make it more > >> efficient, > >> > use the same Reducer as the Combiner. > >> > > >> > Then do a 2nd Job where you map the Count as the Key, and Item as the > >> > Value, use 1 Reducer, and Identity Reduce it (e.g. don't do any > >> reducing, > >> > just output the Count,Item). > >> > > >> > Aaron Baff | Developer | Telescope, Inc. > >> > > >> > email: [email protected] | office: 424 270 2913 | > >> www.telescope.tv > >> > > >> > Bored with summer reruns? Spice up your TV week by watching and > voting > >> for > >> > your favorite act on America's Got Talent, 9pm ET/CT Tuesday nights on > >> NBC. > >> > > >> > The information contained in this email is confidential and may be > >> legally > >> > privileged. It is intended solely for the addressee. Access to this > >> email by > >> > anyone else is unauthorized. If you are not the intended recipient, > any > >> > disclosure, copying, distribution or any action taken or omitted to be > >> taken > >> > in reliance on it, is prohibited and may be unlawful. Any views > >> expressed in > >> > this message are those of the individual and may not necessarily > reflect > >> the > >> > views of Telescope Inc. or its associated companies. > >> > > >> > > >> > -----Original Message----- > >> > From: Neil Ghosh [mailto:[email protected]] > >> > Sent: Friday, September 10, 2010 3:51 PM > >> > To: James Seigel > >> > Cc: [email protected] > >> > Subject: Re: TOP N items > >> > > >> > Thanks James, > >> > > >> > This gives me only N results for sure but not necessarily the top N > >> > > >> > I have used the Item as Key and Count as Value as input to the > reducer. > >> > > >> > and my reducing logic is to sum the count for a particular item. > >> > > >> > Now my output comes as grouped but not in order. > >> > > >> > Do I need to use custom comparator ? > >> > > >> > Thanks > >> > Neil > >> > > >> > On Sat, Sep 11, 2010 at 2:41 AM, James Seigel <[email protected]> wrote: > >> > > >> > > Welcome to the land of the fuzzy elephant! > >> > > > >> > > Of course there are many ways to do it. Here is one, it might not > be > >> > > brilliant or the right was, but I am sure you will get more :) > >> > > > >> > > Use the identity mapper... > >> > > > >> > > job.setMapperClass(Mapper.class); > >> > > > >> > > then have one reducer.... > >> > > > >> > > job.setNumReduceTasks(1); > >> > > > >> > > then have a reducer that has something like this around your > reducing > >> > > code... > >> > > > >> > > Counter counter = context.getCounter("ME", "total output > >> records" > >> > > ); > >> > > if (counter.getValue() < LIMIT) { > >> > > > >> > > <do your reducey stuff here> > >> > > > >> > > context.write(key, value); > >> > > counter.increment(1); > >> > > } > >> > > > >> > > > >> > > Cheers > >> > > James. > >> > > > >> > > > >> > > > >> > > On 2010-09-10, at 3:04 PM, Neil Ghosh wrote: > >> > > > >> > > Hello , > >> > > > >> > > I am new to Hadoop.Can anybody suggest any example or procedure of > >> > > outputting TOP N items having maximum total count, where the input > >> file > >> > has > >> > > have (Item, count ) pair in each line . > >> > > > >> > > Items can repeat. > >> > > > >> > > Thanks > >> > > Neil > >> > > http://neilghosh.com > >> > > > >> > > -- > >> > > Thanks and Regards > >> > > Neil > >> > > http://neilghosh.com > >> > > > >> > > > >> > > > >> > > >> > > >> > -- > >> > Thanks and Regards > >> > Neil > >> > http://neilghosh.com > >> > > >> > >> > >> > >> -- > >> Thanks and Regards > >> Neil > >> http://neilghosh.com > >> > > > > > > > -- > Thanks and Regards > Neil > http://neilghosh.com >
