Actually what I am trying to find to top n% of the whole data.
This n could be very large if my data is large.

Assuming I have uniform rows of equal size and if the total data size
is 10 GB, using the above mentioned approach, if I have to take top
10% of the whole data set, I need 10% of 10GB which could be rows
worth of 1 GB (roughly) in my mappers.
I think that would not be possible given my input splits are of
64/128/512 MB (based on my block size) or am I making wrong
assumptions. I can increase the inputsplit size, but is there a better
way to find top n%.


My whole actual problem is to give ranks to some values and then find
out the top 10 ranks.

I think this context can give more idea about the problem ?

Regards
Praveenesh

On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ekirpic...@gmail.com> wrote:
> Hi,
>
> Can you tell more about:
>  * How big is N
>  * How big is the input dataset
>  * How many mappers you have
>  * Do input splits correlate with the sorting criterion for top N?
>
> Depending on the answers, very different strategies will be optimal.
>
>
>
> On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar <praveen...@gmail.com>wrote:
>
>> I am looking for a better solution for this.
>>
>> 1 way to do this would be to find top N values from each mappers and
>> then find out the top N out of them in 1 reducer.  I am afraid that
>> this won't work effectively if my N is larger than number of values in
>> my inputsplit (or mapper input).
>>
>> Otherway is to just sort all of them in 1 reducer and then do the cat of
>> top-N.
>>
>> Wondering if there is any better approach to do this ?
>>
>> Regards
>> Praveenesh
>>
>
>
>
> --
> Eugene Kirpichov
> http://www.linkedin.com/in/eugenekirpichov
> http://jkff.info/software/timeplotters - my performance visualization tools

Reply via email to