[ 
https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732766#action_12732766
 ] 

Ying He commented on PIG-792:
-----------------------------

For MPCompiler, the job parallelism is reset to deal with situation when 
parallelism is not specified. In this case, sampling process uses (0.9 * 
default reducer) as the total number of reducers when allocating reducers to 
skewed keys. So the next MR job should use it as parallelism.  If parallelism 
is specified, the rp returned from sampling process is equal to the original 
value of op.

the format of sampling output file is documented in SkewedPartitioner

POSkewedJoinFileSetter is removed, the logic is added into SampleOptimizer

MapReduceOper keeps the file name of the sampling, so that MapReduceLauncher 
can set this file name into the jobconf of the join job.

> PERFORMANCE: Support skewed join in pig
> ---------------------------------------
>
>                 Key: PIG-792
>                 URL: https://issues.apache.org/jira/browse/PIG-792
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Sriranjan Manjunath
>         Attachments: skewedjoin.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to