The only solution I can think of is by creating a counter in Hadoop that is incremented each time a mapper lets a record through. As soon as the value reaches a preselected value the mappers simply discard the additional input they receive.
Note that this will not at all be random.... yet it's the best I can come up with right now. HTH On Mon, Jun 27, 2011 at 09:11, Jeff Zhang <zjf...@gmail.com> wrote: > > Hi all, > I'd like to select random N records from a large amount of data using > hadoop, just wonder how can I archive this ? Currently my idea is that let > each mapper task select N / mapper_number records. Does anyone has such > experience ? > > -- > Best Regards > > Jeff Zhang > -- Best regards / Met vriendelijke groeten, Niels Basjes