[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789013#action_12789013
 ] 

Thejas M Nair commented on PIG-1143:
------------------------------------

The PoissonSampleLoader implementation in Load-store redesign does not check 
the file size and has a different approach for the following reason (as 
mentioned in PIG-1062) -

With new interfaces in load-store redesign, pig can compute the file size by 
adding up size of each split (from InputSplit.getLenght()) . But the 
documentation of the function does not make it clear if this is size on disk , 
compressed/uncompressed etc. Looks like it just needs to be some number 
proportional to size of the file. Assuming it is size on disk (uncompressed), 
using this to estimate the total memory it will require is tricky, one has to 
make assumptions about the compression ratio and the serialization method.
Using Tuple.getMemorySize() while sampling will give more accurate numbers for 
reducer memory that it will consume. 


> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>
> The current poisson sampler forces each of the maps to compute the sample 
> number. This is redundant and causes issues when a large directory is 
> specified in the join. The sampler should be changed to calculate the sample 
> count only once and this information should be shared with the remaining 
> mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to