[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789046#action_12789046
 ] 

Thejas M Nair commented on PIG-1143:
------------------------------------

Pig input does not have to be a file, the LoadFunc could be reading from HBase 
or some other source. So the use of FileLocalizer.getSize(fname,pcProps) will 
not work in all cases.
 InputSplits.getLength() can be used instead, but as per the documentation, the 
purpose of  InputSplits.getLength() is "so that the input splits can be sorted 
by size". So implementations might just give a number that is proportional to 
the size if they don't have access to actual size. 

Even if the actual file size on disk is available through  
InputSplits.getLength(), in case of columnar storage the compression can be 
very high (eg run-length encoding of column that is sort key with only few 
unique values), and we might end up sampling very little.

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
>
> The current poisson sampler forces each of the maps to compute the sample 
> number. This is redundant and causes issues when a large directory is 
> specified in the join. The sampler should be changed to calculate the sample 
> count only once and this information should be shared with the remaining 
> mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to