Re: TOP N items

Neil Ghosh Fri, 10 Sep 2010 17:38:23 -0700

Hi Alex ,

Thanks so much for the reply . As of now I don't have any issue with 2
Jobs.I was just making sure that I am not missing any obvious way of writing
the program in one job.I will get back if I need to optimize on performance
based on specific pattern of input.


Thank you so much you all for helping me on this issue.
Neil

On Sat, Sep 11, 2010 at 5:56 AM, Alex Kozlov <[email protected]> wrote:

> Hi Neil,
>
> Uniques and Top N, as well as percentiles, are inherently difficult to
> distribute/parallelize since you have to have a global view of the dataset.
> You can optimize the computations given some assumptions about the input
> (the # of unique values, prevalence of the most frequent value larger than
> certain threshold, etc.).  There is no way to avoid two jobs in a general
> case.
>
> Can you specify your problem more precisely and assumptions, if any?
>
> --
> Alex Kozlov
> Solutions Architect
> Cloudera, Inc
> twitter: alexvk2009
>
> Hadoop World 2010, October 12, New York City - Register now:
> http://www.cloudera.com/company/press-center/hadoop-world-nyc/
>
> On Fri, Sep 10, 2010 at 5:14 PM, Neil Ghosh <[email protected]> wrote:
>
>> Thanks Aaron. I employed two Jobs and solved the problem.
>>
>> I was just wondering is there anyway , it can be done in single  job so
>> that
>> disk/network I/O is less and no temporary storage is required between 1st
>> and second job.
>>
>> Neil
>>
>> On Sat, Sep 11, 2010 at 4:37 AM, Aaron Baff <[email protected]>
>> wrote:
>>
>> > I'm still fairly new at MapReduce, but here's my thoughts the solution.
>> >
>> > Use the Item as the Key, the Count as the Value, in the Reducer, sum up
>> all
>> > of the Count's and output the Item,sum(Count). To make it more
>> efficient,
>> > use the same Reducer as the Combiner.
>> >
>> > Then do a 2nd Job where you map the Count as the Key, and Item as the
>> > Value, use 1 Reducer, and Identity Reduce it (e.g. don't do any
>> reducing,
>> > just output the Count,Item).
>> >
>> > Aaron Baff | Developer | Telescope, Inc.
>> >
>> > email:  [email protected] | office:  424 270 2913 |
>> www.telescope.tv
>> >
>> > Bored with summer reruns?  Spice up your TV week by watching and voting
>> for
>> > your favorite act on America's Got Talent, 9pm ET/CT Tuesday nights on
>> NBC.
>> >
>> > The information contained in this email is confidential and may be
>> legally
>> > privileged. It is intended solely for the addressee. Access to this
>> email by
>> > anyone else is unauthorized. If you are not the intended recipient, any
>> > disclosure, copying, distribution or any action taken or omitted to be
>> taken
>> > in reliance on it, is prohibited and may be unlawful. Any views
>> expressed in
>> > this message are those of the individual and may not necessarily reflect
>> the
>> > views of Telescope Inc. or its associated companies.
>> >
>> >
>> > -----Original Message-----
>> > From: Neil Ghosh [mailto:[email protected]]
>> > Sent: Friday, September 10, 2010 3:51 PM
>> > To: James Seigel
>> > Cc: [email protected]
>> > Subject: Re: TOP N items
>> >
>> > Thanks James,
>> >
>> > This gives me only N results for sure but not necessarily the top N
>> >
>> > I have used the Item as Key and Count as Value as input to the reducer.
>> >
>> > and my reducing logic is to sum the count for a particular item.
>> >
>> > Now my output comes as grouped but not in order.
>> >
>> > Do I need to use custom comparator ?
>> >
>> > Thanks
>> > Neil
>> >
>> > On Sat, Sep 11, 2010 at 2:41 AM, James Seigel <[email protected]> wrote:
>> >
>> > > Welcome to the land of the fuzzy elephant!
>> > >
>> > > Of course there are many ways to do it.  Here is one, it might not be
>> > > brilliant or the right was, but I am sure you will get more :)
>> > >
>> > > Use the identity mapper...
>> > >
>> > >         job.setMapperClass(Mapper.class);
>> > >
>> > > then have one reducer....
>> > >
>> > >         job.setNumReduceTasks(1);
>> > >
>> > > then have a reducer that has something like this around your reducing
>> > > code...
>> > >
>> > >         Counter counter = context.getCounter("ME", "total output
>> records"
>> > > );
>> > >         if (counter.getValue() < LIMIT) {
>> > >
>> > >      <do your reducey stuff here>
>> > >
>> > >             context.write(key, value);
>> > >             counter.increment(1);
>> > >         }
>> > >
>> > >
>> > > Cheers
>> > > James.
>> > >
>> > >
>> > >
>> > > On 2010-09-10, at 3:04 PM, Neil Ghosh wrote:
>> > >
>> > > Hello ,
>> > >
>> > > I am new to Hadoop.Can anybody suggest any example or procedure of
>> > > outputting TOP N items having maximum total count, where the input
>> file
>> > has
>> > > have (Item, count ) pair  in each line .
>> > >
>> > > Items can repeat.
>> > >
>> > > Thanks
>> > > Neil
>> > > http://neilghosh.com
>> > >
>> > > --
>> > > Thanks and Regards
>> > > Neil
>> > > http://neilghosh.com
>> > >
>> > >
>> > >
>> >
>> >
>> > --
>> > Thanks and Regards
>> > Neil
>> > http://neilghosh.com
>> >
>>
>>
>>
>> --
>> Thanks and Regards
>> Neil
>> http://neilghosh.com
>>
>
>


-- 
Thanks and Regards
Neil
http://neilghosh.com

Re: TOP N items

Reply via email to