Maybe look at the pig source to see how it does it? Russell Jurney http://datasyndrome.com
On Feb 1, 2013, at 11:37 PM, praveenesh kumar <praveen...@gmail.com> wrote: > Thanks for that Russell. Unfortunately I can't use Pig. Need to write > my own MR job. I was wondering how its usually done in the best way > possible. > > Regards > Praveenesh > > On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney <russell.jur...@gmail.com> > wrote: >> Pig. Datafu. 7 lines of code. >> >> https://gist.github.com/4696443 >> https://github.com/linkedin/datafu >> >> >> On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar >> <praveen...@gmail.com>wrote: >> >>> Actually what I am trying to find to top n% of the whole data. >>> This n could be very large if my data is large. >>> >>> Assuming I have uniform rows of equal size and if the total data size >>> is 10 GB, using the above mentioned approach, if I have to take top >>> 10% of the whole data set, I need 10% of 10GB which could be rows >>> worth of 1 GB (roughly) in my mappers. >>> I think that would not be possible given my input splits are of >>> 64/128/512 MB (based on my block size) or am I making wrong >>> assumptions. I can increase the inputsplit size, but is there a better >>> way to find top n%. >>> >>> >>> My whole actual problem is to give ranks to some values and then find >>> out the top 10 ranks. >>> >>> I think this context can give more idea about the problem ? >>> >>> Regards >>> Praveenesh >>> >>> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ekirpic...@gmail.com> >>> wrote: >>>> Hi, >>>> >>>> Can you tell more about: >>>> * How big is N >>>> * How big is the input dataset >>>> * How many mappers you have >>>> * Do input splits correlate with the sorting criterion for top N? >>>> >>>> Depending on the answers, very different strategies will be optimal. >>>> >>>> >>>> >>>> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <praveen...@gmail.com >>>> wrote: >>>> >>>>> I am looking for a better solution for this. >>>>> >>>>> 1 way to do this would be to find top N values from each mappers and >>>>> then find out the top N out of them in 1 reducer. I am afraid that >>>>> this won't work effectively if my N is larger than number of values in >>>>> my inputsplit (or mapper input). >>>>> >>>>> Otherway is to just sort all of them in 1 reducer and then do the cat of >>>>> top-N. >>>>> >>>>> Wondering if there is any better approach to do this ? >>>>> >>>>> Regards >>>>> Praveenesh >>>>> >>>> >>>> >>>> >>>> -- >>>> Eugene Kirpichov >>>> http://www.linkedin.com/in/eugenekirpichov >>>> http://jkff.info/software/timeplotters - my performance visualization >>> tools >>> >> >> >> >> -- >> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com