Hi Alex , Thanks so much for the reply . As of now I don't have any issue with 2 Jobs.I was just making sure that I am not missing any obvious way of writing the program in one job.I will get back if I need to optimize on performance based on specific pattern of input.
Thank you so much you all for helping me on this issue. Neil On Sat, Sep 11, 2010 at 5:56 AM, Alex Kozlov <[email protected]> wrote: > Hi Neil, > > Uniques and Top N, as well as percentiles, are inherently difficult to > distribute/parallelize since you have to have a global view of the dataset. > You can optimize the computations given some assumptions about the input > (the # of unique values, prevalence of the most frequent value larger than > certain threshold, etc.). There is no way to avoid two jobs in a general > case. > > Can you specify your problem more precisely and assumptions, if any? > > -- > Alex Kozlov > Solutions Architect > Cloudera, Inc > twitter: alexvk2009 > > Hadoop World 2010, October 12, New York City - Register now: > http://www.cloudera.com/company/press-center/hadoop-world-nyc/ > > On Fri, Sep 10, 2010 at 5:14 PM, Neil Ghosh <[email protected]> wrote: > >> Thanks Aaron. I employed two Jobs and solved the problem. >> >> I was just wondering is there anyway , it can be done in single job so >> that >> disk/network I/O is less and no temporary storage is required between 1st >> and second job. >> >> Neil >> >> On Sat, Sep 11, 2010 at 4:37 AM, Aaron Baff <[email protected]> >> wrote: >> >> > I'm still fairly new at MapReduce, but here's my thoughts the solution. >> > >> > Use the Item as the Key, the Count as the Value, in the Reducer, sum up >> all >> > of the Count's and output the Item,sum(Count). To make it more >> efficient, >> > use the same Reducer as the Combiner. >> > >> > Then do a 2nd Job where you map the Count as the Key, and Item as the >> > Value, use 1 Reducer, and Identity Reduce it (e.g. don't do any >> reducing, >> > just output the Count,Item). >> > >> > Aaron Baff | Developer | Telescope, Inc. >> > >> > email: [email protected] | office: 424 270 2913 | >> www.telescope.tv >> > >> > Bored with summer reruns? Spice up your TV week by watching and voting >> for >> > your favorite act on America's Got Talent, 9pm ET/CT Tuesday nights on >> NBC. >> > >> > The information contained in this email is confidential and may be >> legally >> > privileged. It is intended solely for the addressee. Access to this >> email by >> > anyone else is unauthorized. If you are not the intended recipient, any >> > disclosure, copying, distribution or any action taken or omitted to be >> taken >> > in reliance on it, is prohibited and may be unlawful. Any views >> expressed in >> > this message are those of the individual and may not necessarily reflect >> the >> > views of Telescope Inc. or its associated companies. >> > >> > >> > -----Original Message----- >> > From: Neil Ghosh [mailto:[email protected]] >> > Sent: Friday, September 10, 2010 3:51 PM >> > To: James Seigel >> > Cc: [email protected] >> > Subject: Re: TOP N items >> > >> > Thanks James, >> > >> > This gives me only N results for sure but not necessarily the top N >> > >> > I have used the Item as Key and Count as Value as input to the reducer. >> > >> > and my reducing logic is to sum the count for a particular item. >> > >> > Now my output comes as grouped but not in order. >> > >> > Do I need to use custom comparator ? >> > >> > Thanks >> > Neil >> > >> > On Sat, Sep 11, 2010 at 2:41 AM, James Seigel <[email protected]> wrote: >> > >> > > Welcome to the land of the fuzzy elephant! >> > > >> > > Of course there are many ways to do it. Here is one, it might not be >> > > brilliant or the right was, but I am sure you will get more :) >> > > >> > > Use the identity mapper... >> > > >> > > job.setMapperClass(Mapper.class); >> > > >> > > then have one reducer.... >> > > >> > > job.setNumReduceTasks(1); >> > > >> > > then have a reducer that has something like this around your reducing >> > > code... >> > > >> > > Counter counter = context.getCounter("ME", "total output >> records" >> > > ); >> > > if (counter.getValue() < LIMIT) { >> > > >> > > <do your reducey stuff here> >> > > >> > > context.write(key, value); >> > > counter.increment(1); >> > > } >> > > >> > > >> > > Cheers >> > > James. >> > > >> > > >> > > >> > > On 2010-09-10, at 3:04 PM, Neil Ghosh wrote: >> > > >> > > Hello , >> > > >> > > I am new to Hadoop.Can anybody suggest any example or procedure of >> > > outputting TOP N items having maximum total count, where the input >> file >> > has >> > > have (Item, count ) pair in each line . >> > > >> > > Items can repeat. >> > > >> > > Thanks >> > > Neil >> > > http://neilghosh.com >> > > >> > > -- >> > > Thanks and Regards >> > > Neil >> > > http://neilghosh.com >> > > >> > > >> > > >> > >> > >> > -- >> > Thanks and Regards >> > Neil >> > http://neilghosh.com >> > >> >> >> >> -- >> Thanks and Regards >> Neil >> http://neilghosh.com >> > > -- Thanks and Regards Neil http://neilghosh.com
