Assuming N is not too large in the sense that your reducers can keep a tree
map of N elements, then you can have your reducer maintain the top N
elements in a tree-map (or a priority queue, or a heap, whatever), with
counts as  keys in the tree-map. As the reducers progress, you throw away
the items with smaller counts, and always keep the top N seen so far in the
tree-map. At the end of the reducing process, each reducer outputs the
elements in its tree-map, which should be the top N in that reducer. If your
job has only one reducer, then the output is your final answer. Otherwise,
you can use a second job (or any tools) to merge/sort the R output files
(each has the top N elements for the corresponding reducer). If N * R <<
input size, the second step should take much less time than the first one.



On Fri, Sep 10, 2010 at 5:37 PM, Neil Ghosh <[email protected]> wrote:

> Hi Alex ,
>
> Thanks so much for the reply . As of now I don't have any issue with 2
> Jobs.I was just making sure that I am not missing any obvious way of
> writing
> the program in one job.I will get back if I need to optimize on performance
> based on specific pattern of input.
>
> Thank you so much you all for helping me on this issue.
> Neil
>
> On Sat, Sep 11, 2010 at 5:56 AM, Alex Kozlov <[email protected]> wrote:
>
> > Hi Neil,
> >
> > Uniques and Top N, as well as percentiles, are inherently difficult to
> > distribute/parallelize since you have to have a global view of the
> dataset.
> > You can optimize the computations given some assumptions about the input
> > (the # of unique values, prevalence of the most frequent value larger
> than
> > certain threshold, etc.).  There is no way to avoid two jobs in a general
> > case.
> >
> > Can you specify your problem more precisely and assumptions, if any?
> >
> > --
> > Alex Kozlov
> > Solutions Architect
> > Cloudera, Inc
> > twitter: alexvk2009
> >
> > Hadoop World 2010, October 12, New York City - Register now:
> > http://www.cloudera.com/company/press-center/hadoop-world-nyc/
> >
> > On Fri, Sep 10, 2010 at 5:14 PM, Neil Ghosh <[email protected]>
> wrote:
> >
> >> Thanks Aaron. I employed two Jobs and solved the problem.
> >>
> >> I was just wondering is there anyway , it can be done in single  job so
> >> that
> >> disk/network I/O is less and no temporary storage is required between
> 1st
> >> and second job.
> >>
> >> Neil
> >>
> >> On Sat, Sep 11, 2010 at 4:37 AM, Aaron Baff <[email protected]>
> >> wrote:
> >>
> >> > I'm still fairly new at MapReduce, but here's my thoughts the
> solution.
> >> >
> >> > Use the Item as the Key, the Count as the Value, in the Reducer, sum
> up
> >> all
> >> > of the Count's and output the Item,sum(Count). To make it more
> >> efficient,
> >> > use the same Reducer as the Combiner.
> >> >
> >> > Then do a 2nd Job where you map the Count as the Key, and Item as the
> >> > Value, use 1 Reducer, and Identity Reduce it (e.g. don't do any
> >> reducing,
> >> > just output the Count,Item).
> >> >
> >> > Aaron Baff | Developer | Telescope, Inc.
> >> >
> >> > email:  [email protected] | office:  424 270 2913 |
> >> www.telescope.tv
> >> >
> >> > Bored with summer reruns?  Spice up your TV week by watching and
> voting
> >> for
> >> > your favorite act on America's Got Talent, 9pm ET/CT Tuesday nights on
> >> NBC.
> >> >
> >> > The information contained in this email is confidential and may be
> >> legally
> >> > privileged. It is intended solely for the addressee. Access to this
> >> email by
> >> > anyone else is unauthorized. If you are not the intended recipient,
> any
> >> > disclosure, copying, distribution or any action taken or omitted to be
> >> taken
> >> > in reliance on it, is prohibited and may be unlawful. Any views
> >> expressed in
> >> > this message are those of the individual and may not necessarily
> reflect
> >> the
> >> > views of Telescope Inc. or its associated companies.
> >> >
> >> >
> >> > -----Original Message-----
> >> > From: Neil Ghosh [mailto:[email protected]]
> >> > Sent: Friday, September 10, 2010 3:51 PM
> >> > To: James Seigel
> >> > Cc: [email protected]
> >> > Subject: Re: TOP N items
> >> >
> >> > Thanks James,
> >> >
> >> > This gives me only N results for sure but not necessarily the top N
> >> >
> >> > I have used the Item as Key and Count as Value as input to the
> reducer.
> >> >
> >> > and my reducing logic is to sum the count for a particular item.
> >> >
> >> > Now my output comes as grouped but not in order.
> >> >
> >> > Do I need to use custom comparator ?
> >> >
> >> > Thanks
> >> > Neil
> >> >
> >> > On Sat, Sep 11, 2010 at 2:41 AM, James Seigel <[email protected]> wrote:
> >> >
> >> > > Welcome to the land of the fuzzy elephant!
> >> > >
> >> > > Of course there are many ways to do it.  Here is one, it might not
> be
> >> > > brilliant or the right was, but I am sure you will get more :)
> >> > >
> >> > > Use the identity mapper...
> >> > >
> >> > >         job.setMapperClass(Mapper.class);
> >> > >
> >> > > then have one reducer....
> >> > >
> >> > >         job.setNumReduceTasks(1);
> >> > >
> >> > > then have a reducer that has something like this around your
> reducing
> >> > > code...
> >> > >
> >> > >         Counter counter = context.getCounter("ME", "total output
> >> records"
> >> > > );
> >> > >         if (counter.getValue() < LIMIT) {
> >> > >
> >> > >      <do your reducey stuff here>
> >> > >
> >> > >             context.write(key, value);
> >> > >             counter.increment(1);
> >> > >         }
> >> > >
> >> > >
> >> > > Cheers
> >> > > James.
> >> > >
> >> > >
> >> > >
> >> > > On 2010-09-10, at 3:04 PM, Neil Ghosh wrote:
> >> > >
> >> > > Hello ,
> >> > >
> >> > > I am new to Hadoop.Can anybody suggest any example or procedure of
> >> > > outputting TOP N items having maximum total count, where the input
> >> file
> >> > has
> >> > > have (Item, count ) pair  in each line .
> >> > >
> >> > > Items can repeat.
> >> > >
> >> > > Thanks
> >> > > Neil
> >> > > http://neilghosh.com
> >> > >
> >> > > --
> >> > > Thanks and Regards
> >> > > Neil
> >> > > http://neilghosh.com
> >> > >
> >> > >
> >> > >
> >> >
> >> >
> >> > --
> >> > Thanks and Regards
> >> > Neil
> >> > http://neilghosh.com
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks and Regards
> >> Neil
> >> http://neilghosh.com
> >>
> >
> >
>
>
> --
> Thanks and Regards
> Neil
> http://neilghosh.com
>

Reply via email to