[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789046#action_12789046 ]
Thejas M Nair commented on PIG-1143: ------------------------------------ Pig input does not have to be a file, the LoadFunc could be reading from HBase or some other source. So the use of FileLocalizer.getSize(fname,pcProps) will not work in all cases. InputSplits.getLength() can be used instead, but as per the documentation, the purpose of InputSplits.getLength() is "so that the input splits can be sorted by size". So implementations might just give a number that is proportional to the size if they don't have access to actual size. Even if the actual file size on disk is available through InputSplits.getLength(), in case of columnar storage the compression can be very high (eg run-length encoding of column that is sort key with only few unique values), and we might end up sampling very little. > Poisson Sample Loader should compute the number of samples required only once > ----------------------------------------------------------------------------- > > Key: PIG-1143 > URL: https://issues.apache.org/jira/browse/PIG-1143 > Project: Pig > Issue Type: Bug > Reporter: Sriranjan Manjunath > Assignee: Sriranjan Manjunath > > The current poisson sampler forces each of the maps to compute the sample > number. This is redundant and causes issues when a large directory is > specified in the join. The sampler should be changed to calculate the sample > count only once and this information should be shared with the remaining > mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.