Thejas M Nair commented on PIG-1062:

Skew-join uses the total number of input tuples, in 
PartitionSkewedKeys.calculateReducers(..) to calculate number of reducers.
In the version in trunk, PoissonSampleLoader adds  size on disk of the sampled 
tuple , as the last column of the tuple. This is used to calculate average size 
on disk in PartitionSkewedKeys. Total number of tuples are estimated using 
input-file-size/avg-size-of-tuple-on-disk .

But with the new interface, the size on disk for a tuple cannot be estimated 
(there is no getPosition). Also, the size of input file on disk cannot be 
estimated if the input is not from a file or if the load function is passed 
some metadata instead of file name.

Ideally this information should be obtained through  ResourceStatistics in the 
proposal. Since that is not available right now, here is another proposal - 

PoissonSampleLoader currently reads almost all the rows because it tries to 
sample evenly spaced tuples from the split. It will now read till the last 
tuple, and add an additional tuple that has the number of tuples in that split. 
This special tuple needs to be distinguished from others that are sampled 
tuples. I don't have a good way to do that except for having two columns first 
column having an unique marker string, and second column has the number of 
rows. Does anybody have better suggestions ?

PartitionSkewedKeys will look at all these special rows and add the row-nums to 
get total number of rows.

> load-store-redesign branch: change SampleLoader and subclasses to work with 
> new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Task
>            Reporter: Thejas M Nair
> This is part of the effort to implement new load store interfaces as laid out 
> in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
> be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to