Sriranjan Manjunath commented on PIG-1143:

To describe the problem in more detail, the current implementation does not 
handle a glob efficiently. When the sample loader encounters a directory (or 
combinations thereof), it gets the element descriptors of all the files inside 
the directory to compute the file sizes.
For ex: A = load "{view, click}" will result in computing file sizes of all the 
files underneath both "view" and "click" directories. If we have a large number 
of mappers, this will result in a ton of hdfs system calls, clogging the name 

I intend to modify Poisson Sample Loader as follows. The algorithm for 
computing the total number of samples remains the same. However, it will not be 
computed by every mapper. I will be using the UDFContext object to share this 
information across mappers. Since mapper/ reducers can only read the 
information from UDFContext, the slicer will store this information. The slicer 
will compute the sampler count for the first map. As before, PigSlice will call 
computeSamples() for the first map. It will then store this value as a property 
in the UDFContext object. The Slicer will check UDFContext to see if this value 
is set and if it is, it will use it instead of computing it again. I intend to 
use "pig.input.0.sampleCount" as the key.

This solution will reduce the fileSize() invocations to a minimum and should 
reduce the burden on the name node.

> Poisson Sample Loader should compute the number of samples required only once
> -----------------------------------------------------------------------------
>                 Key: PIG-1143
>                 URL: https://issues.apache.org/jira/browse/PIG-1143
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Sriranjan Manjunath
>            Assignee: Sriranjan Manjunath
> The current poisson sampler forces each of the maps to compute the sample 
> number. This is redundant and causes issues when a large directory is 
> specified in the join. The sampler should be changed to calculate the sample 
> count only once and this information should be shared with the remaining 
> mappers.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to