To answer my own question.  I applied a non-repeatable random 
number generator in the mapper. At mapper setup stage I generate 
a pre-defined number of random numbers, then I use a counter 
along the mapper.  When the counter is contained in the random 
number set, the Mapper executes and outputs data. The problem 
now becomes how to know the ceiling of random number 
[1...ceiling]. That ceiling number cannot be too small to make 
sampling valid, it also cannot exceed the total number of data 
records contained in each split. The problem is because my data 
is not divided by line, sometimes a complete data record is 
composed by multiple lines, so I am not sure how to estimate 
that ceiling number ... Of course, if each line is a complete 
record, that ceiling number is easy to obtain.

Reply via email to