Hi Neil, Uniques and Top N, as well as percentiles, are inherently difficult to distribute/parallelize since you have to have a global view of the dataset. You can optimize the computations given some assumptions about the input (the # of unique values, prevalence of the most frequent value larger than certain threshold, etc.). There is no way to avoid two jobs in a general case.
Can you specify your problem more precisely and assumptions, if any? -- Alex Kozlov Solutions Architect Cloudera, Inc twitter: alexvk2009 Hadoop World 2010, October 12, New York City - Register now: http://www.cloudera.com/company/press-center/hadoop-world-nyc/ On Fri, Sep 10, 2010 at 5:14 PM, Neil Ghosh <[email protected]> wrote: > Thanks Aaron. I employed two Jobs and solved the problem. > > I was just wondering is there anyway , it can be done in single job so > that > disk/network I/O is less and no temporary storage is required between 1st > and second job. > > Neil > > On Sat, Sep 11, 2010 at 4:37 AM, Aaron Baff <[email protected]> > wrote: > > > I'm still fairly new at MapReduce, but here's my thoughts the solution. > > > > Use the Item as the Key, the Count as the Value, in the Reducer, sum up > all > > of the Count's and output the Item,sum(Count). To make it more efficient, > > use the same Reducer as the Combiner. > > > > Then do a 2nd Job where you map the Count as the Key, and Item as the > > Value, use 1 Reducer, and Identity Reduce it (e.g. don't do any reducing, > > just output the Count,Item). > > > > Aaron Baff | Developer | Telescope, Inc. > > > > email: [email protected] | office: 424 270 2913 | > www.telescope.tv > > > > Bored with summer reruns? Spice up your TV week by watching and voting > for > > your favorite act on America's Got Talent, 9pm ET/CT Tuesday nights on > NBC. > > > > The information contained in this email is confidential and may be > legally > > privileged. It is intended solely for the addressee. Access to this email > by > > anyone else is unauthorized. If you are not the intended recipient, any > > disclosure, copying, distribution or any action taken or omitted to be > taken > > in reliance on it, is prohibited and may be unlawful. Any views expressed > in > > this message are those of the individual and may not necessarily reflect > the > > views of Telescope Inc. or its associated companies. > > > > > > -----Original Message----- > > From: Neil Ghosh [mailto:[email protected]] > > Sent: Friday, September 10, 2010 3:51 PM > > To: James Seigel > > Cc: [email protected] > > Subject: Re: TOP N items > > > > Thanks James, > > > > This gives me only N results for sure but not necessarily the top N > > > > I have used the Item as Key and Count as Value as input to the reducer. > > > > and my reducing logic is to sum the count for a particular item. > > > > Now my output comes as grouped but not in order. > > > > Do I need to use custom comparator ? > > > > Thanks > > Neil > > > > On Sat, Sep 11, 2010 at 2:41 AM, James Seigel <[email protected]> wrote: > > > > > Welcome to the land of the fuzzy elephant! > > > > > > Of course there are many ways to do it. Here is one, it might not be > > > brilliant or the right was, but I am sure you will get more :) > > > > > > Use the identity mapper... > > > > > > job.setMapperClass(Mapper.class); > > > > > > then have one reducer.... > > > > > > job.setNumReduceTasks(1); > > > > > > then have a reducer that has something like this around your reducing > > > code... > > > > > > Counter counter = context.getCounter("ME", "total output > records" > > > ); > > > if (counter.getValue() < LIMIT) { > > > > > > <do your reducey stuff here> > > > > > > context.write(key, value); > > > counter.increment(1); > > > } > > > > > > > > > Cheers > > > James. > > > > > > > > > > > > On 2010-09-10, at 3:04 PM, Neil Ghosh wrote: > > > > > > Hello , > > > > > > I am new to Hadoop.Can anybody suggest any example or procedure of > > > outputting TOP N items having maximum total count, where the input file > > has > > > have (Item, count ) pair in each line . > > > > > > Items can repeat. > > > > > > Thanks > > > Neil > > > http://neilghosh.com > > > > > > -- > > > Thanks and Regards > > > Neil > > > http://neilghosh.com > > > > > > > > > > > > > > > -- > > Thanks and Regards > > Neil > > http://neilghosh.com > > > > > > -- > Thanks and Regards > Neil > http://neilghosh.com >
