[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788971#action_12788971 ]
Sriranjan Manjunath commented on PIG-1143: ------------------------------------------ To describe the problem in more detail, the current implementation does not handle a glob efficiently. When the sample loader encounters a directory (or combinations thereof), it gets the element descriptors of all the files inside the directory to compute the file sizes. For ex: A = load "{view, click}" will result in computing file sizes of all the files underneath both "view" and "click" directories. If we have a large number of mappers, this will result in a ton of hdfs system calls, clogging the name node. I intend to modify Poisson Sample Loader as follows. The algorithm for computing the total number of samples remains the same. However, it will not be computed by every mapper. I will be using the UDFContext object to share this information across mappers. Since mapper/ reducers can only read the information from UDFContext, the slicer will store this information. The slicer will compute the sampler count for the first map. As before, PigSlice will call computeSamples() for the first map. It will then store this value as a property in the UDFContext object. The Slicer will check UDFContext to see if this value is set and if it is, it will use it instead of computing it again. I intend to use "pig.input.0.sampleCount" as the key. This solution will reduce the fileSize() invocations to a minimum and should reduce the burden on the name node. > Poisson Sample Loader should compute the number of samples required only once > ----------------------------------------------------------------------------- > > Key: PIG-1143 > URL: https://issues.apache.org/jira/browse/PIG-1143 > Project: Pig > Issue Type: Bug > Reporter: Sriranjan Manjunath > Assignee: Sriranjan Manjunath > > The current poisson sampler forces each of the maps to compute the sample > number. This is redundant and causes issues when a large directory is > specified in the join. The sampler should be changed to calculate the sample > count only once and this information should be shared with the remaining > mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.