Sorry for the very slow response, but here it is, hopefully better late than 
never.

On Jul 25, 2012, at 4:28 PM, Prasanth J wrote:

> Thanks Alan.
> The requirement for me is that I want to load N number of samples based on 
> the input file size and perform naive cube computation to determine the large 
> groups that will not fit in reducer's memory. I need to know the exact number 
> of samples for calculating the partition factor for large groups. 
> Currently I am using RandomSampleLoader to load 1000 tuples from each mapper. 
> Without knowing the number of mappers I will not be able to find the exact 
> number of samples loaded. Also RandomSampleLoader doesn't attach any special 
> marker (as in PoissonSampleLoader) tuples which tells the number of samples 
> loaded. 
> Is there any other way to know the exact number of samples loaded? 
Not that I know of.

> 
> By analyzing the MR plans of order-by and skewed-join, it seems like the 
> entire dataset is copied to a temp file and then SampleLoaders use the temp 
> file to load samples. Is there any specific reason for this redundant copy? 
> Is it because SampleLoaders can only use pig's internal i/o format? 
Partly, but also because it allows any operators that need to run before the 
sample (like project or filter) to be placed in the pipeline.

Alan.

Reply via email to