Re: How to select random n records using mapreduce ?

David Rosenstrauch Mon, 27 Jun 2011 13:35:45 -0700

Building on this, you could do something like the following to make itmore random:


if (numRecordsWritten < NUM_RECORDS_DESIRED) {
        int n = generateARandomNumberBetween1and100();
        if (n == 100) {
                context.write(key, value);
        }
}

The above would somewhat randomly output 1 record out of every 100, upto a specified maximum amount desired, and discard all the rest.


HTH,

DR

On 06/27/2011 03:28 PM, Niels Basjes wrote:

The only solution I can think of is by creating a counter in Hadoop
that is incremented each time a mapper lets a record through.
As soon as the value reaches a preselected value the mappers simply
discard the additional input they receive.

Note that this will not at all be random.... yet it's the best I can
come up with right now.

HTH

On Mon, Jun 27, 2011 at 09:11, Jeff Zhang<zjf...@gmail.com>  wrote:


Hi all,
I'd like to select random N records from a large amount of data using
hadoop, just wonder how can I archive this ? Currently my idea is that let
each mapper task select N / mapper_number records. Does anyone has such
experience ?

--
Best Regards

Jeff Zhang

Re: How to select random n records using mapreduce ?

Reply via email to