Jeff Lord created PIG-2690:
------------------------------

             Summary: Pig Documentation regarding Merge Join is confusing
                 Key: PIG-2690
                 URL: https://issues.apache.org/jira/browse/PIG-2690
             Project: Pig
          Issue Type: Improvement
          Components: documentation, site
    Affects Versions: 0.8.1, 0.7.0
            Reporter: Jeff Lord


The Documentation regarding merge join in pig is a bit off.

http://pig.apache.org/docs/r0.7.0/piglatin_ref1.html#Merge+Joins

"For optimal performance, each part file of the left (sorted) input of the join 
should have a size of at least 1 hdfs block size (for example if the hdfs block 
size is 128 MB, each part file should be less than 128 MB). If the total input 
size (including all part files) is greater than blocksize, then the part files 
should be uniform in size (without large skews in sizes)."

This is confusing and should read something more akin to this:
http://wiki.apache.org/pig/PigMergeJoin

For optimal performance, each part file of the left (sorted) input of the join 
should have a size of at least 1 hdfs block size (for example if the hdfs block 
size is 128 MB, each part file should be > 128 MB). If the total input size 
(including all part files) is < a blocksize, then the part files should be 
uniform in size (without large skews in sizes). The main idea is to eliminate 
skew in the amount of input the final map job performing the merge-join will 
process.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to