there's one thing i want to clarify that you can use multi-reducers to sort the data globally and then cat all the parts to get the top n records. The data in all parts are globally in order. Then you may find the problem is much easier. 在 2013-2-2 下午3:18,"praveenesh kumar" <praveen...@gmail.com>写道:
> Actually what I am trying to find to top n% of the whole data. > This n could be very large if my data is large. > > Assuming I have uniform rows of equal size and if the total data size > is 10 GB, using the above mentioned approach, if I have to take top > 10% of the whole data set, I need 10% of 10GB which could be rows > worth of 1 GB (roughly) in my mappers. > I think that would not be possible given my input splits are of > 64/128/512 MB (based on my block size) or am I making wrong > assumptions. I can increase the inputsplit size, but is there a better > way to find top n%. > > > My whole actual problem is to give ranks to some values and then find > out the top 10 ranks. > > I think this context can give more idea about the problem ? > > Regards > Praveenesh > > On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ekirpic...@gmail.com> > wrote: > > Hi, > > > > Can you tell more about: > > * How big is N > > * How big is the input dataset > > * How many mappers you have > > * Do input splits correlate with the sorting criterion for top N? > > > > Depending on the answers, very different strategies will be optimal. > > > > > > > > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <praveen...@gmail.com > >wrote: > > > >> I am looking for a better solution for this. > >> > >> 1 way to do this would be to find top N values from each mappers and > >> then find out the top N out of them in 1 reducer. I am afraid that > >> this won't work effectively if my N is larger than number of values in > >> my inputsplit (or mapper input). > >> > >> Otherway is to just sort all of them in 1 reducer and then do the cat of > >> top-N. > >> > >> Wondering if there is any better approach to do this ? > >> > >> Regards > >> Praveenesh > >> > > > > > > > > -- > > Eugene Kirpichov > > http://www.linkedin.com/in/eugenekirpichov > > http://jkff.info/software/timeplotters - my performance visualization > tools >