[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789063#action_12789063
 ] 

Sriranjan Manjunath commented on PIG-1143:
------------------------------------------

I am OK with using InputSplits.getLength() as long as these provide you a good 
estimate of the file size. Without the population size, poisson samplers do now 
work well.

Samplers expect the data to be in BinStorage. If not, the first job reads it 
and stores it into BinStorage. The only exception being if the join follows a 
load/store only MR job.


> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>
> The current poisson sampler forces each of the maps to compute the sample 
> number. This is redundant and causes issues when a large directory is 
> specified in the join. The sampler should be changed to calculate the sample 
> count only once and this information should be shared with the remaining 
> mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to