On Mon, Jun 27, 2011 at 12:11 AM, Jeff Zhang <zjf...@gmail.com> wrote: > > Hi all, > I'd like to select random N records from a large amount of data using > hadoop, just wonder how can I archive this ? Currently my idea is that let > each mapper task select N / mapper_number records. Does anyone has such > experience ?
I've done this before, and it will work fine as long as all of your splits have identical numbers of records.