Re: TOP N items

Alex Kozlov Fri, 10 Sep 2010 17:27:29 -0700

Hi Neil,

Uniques and Top N, as well as percentiles, are inherently difficult to
distribute/parallelize since you have to have a global view of the dataset.
You can optimize the computations given some assumptions about the input
(the # of unique values, prevalence of the most frequent value larger than
certain threshold, etc.).  There is no way to avoid two jobs in a general
case.


Can you specify your problem more precisely and assumptions, if any?

-- 
Alex Kozlov
Solutions Architect
Cloudera, Inc
twitter: alexvk2009

Hadoop World 2010, October 12, New York City - Register now:
http://www.cloudera.com/company/press-center/hadoop-world-nyc/

On Fri, Sep 10, 2010 at 5:14 PM, Neil Ghosh <[email protected]> wrote:

> Thanks Aaron. I employed two Jobs and solved the problem.
>
> I was just wondering is there anyway , it can be done in single  job so
> that
> disk/network I/O is less and no temporary storage is required between 1st
> and second job.
>
> Neil
>
> On Sat, Sep 11, 2010 at 4:37 AM, Aaron Baff <[email protected]>
> wrote:
>
> > I'm still fairly new at MapReduce, but here's my thoughts the solution.
> >
> > Use the Item as the Key, the Count as the Value, in the Reducer, sum up
> all
> > of the Count's and output the Item,sum(Count). To make it more efficient,
> > use the same Reducer as the Combiner.
> >
> > Then do a 2nd Job where you map the Count as the Key, and Item as the
> > Value, use 1 Reducer, and Identity Reduce it (e.g. don't do any reducing,
> > just output the Count,Item).
> >
> > Aaron Baff | Developer | Telescope, Inc.
> >
> > email:  [email protected] | office:  424 270 2913 |
> www.telescope.tv
> >
> > Bored with summer reruns?  Spice up your TV week by watching and voting
> for
> > your favorite act on America's Got Talent, 9pm ET/CT Tuesday nights on
> NBC.
> >
> > The information contained in this email is confidential and may be
> legally
> > privileged. It is intended solely for the addressee. Access to this email
> by
> > anyone else is unauthorized. If you are not the intended recipient, any
> > disclosure, copying, distribution or any action taken or omitted to be
> taken
> > in reliance on it, is prohibited and may be unlawful. Any views expressed
> in
> > this message are those of the individual and may not necessarily reflect
> the
> > views of Telescope Inc. or its associated companies.
> >
> >
> > -----Original Message-----
> > From: Neil Ghosh [mailto:[email protected]]
> > Sent: Friday, September 10, 2010 3:51 PM
> > To: James Seigel
> > Cc: [email protected]
> > Subject: Re: TOP N items
> >
> > Thanks James,
> >
> > This gives me only N results for sure but not necessarily the top N
> >
> > I have used the Item as Key and Count as Value as input to the reducer.
> >
> > and my reducing logic is to sum the count for a particular item.
> >
> > Now my output comes as grouped but not in order.
> >
> > Do I need to use custom comparator ?
> >
> > Thanks
> > Neil
> >
> > On Sat, Sep 11, 2010 at 2:41 AM, James Seigel <[email protected]> wrote:
> >
> > > Welcome to the land of the fuzzy elephant!
> > >
> > > Of course there are many ways to do it.  Here is one, it might not be
> > > brilliant or the right was, but I am sure you will get more :)
> > >
> > > Use the identity mapper...
> > >
> > >         job.setMapperClass(Mapper.class);
> > >
> > > then have one reducer....
> > >
> > >         job.setNumReduceTasks(1);
> > >
> > > then have a reducer that has something like this around your reducing
> > > code...
> > >
> > >         Counter counter = context.getCounter("ME", "total output
> records"
> > > );
> > >         if (counter.getValue() < LIMIT) {
> > >
> > >      <do your reducey stuff here>
> > >
> > >             context.write(key, value);
> > >             counter.increment(1);
> > >         }
> > >
> > >
> > > Cheers
> > > James.
> > >
> > >
> > >
> > > On 2010-09-10, at 3:04 PM, Neil Ghosh wrote:
> > >
> > > Hello ,
> > >
> > > I am new to Hadoop.Can anybody suggest any example or procedure of
> > > outputting TOP N items having maximum total count, where the input file
> > has
> > > have (Item, count ) pair  in each line .
> > >
> > > Items can repeat.
> > >
> > > Thanks
> > > Neil
> > > http://neilghosh.com
> > >
> > > --
> > > Thanks and Regards
> > > Neil
> > > http://neilghosh.com
> > >
> > >
> > >
> >
> >
> > --
> > Thanks and Regards
> > Neil
> > http://neilghosh.com
> >
>
>
>
> --
> Thanks and Regards
> Neil
> http://neilghosh.com
>

Re: TOP N items

Reply via email to