[ 
https://issues.apache.org/jira/browse/PIG-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15753547#comment-15753547
 ] 

Xianda Ke commented on PIG-4858:
--------------------------------

Hi [~kellyzly], Currently,the function getSamplingJob() in PIG-5044's patch is 
not suitable for PIG-4858. Because SkewedJoin in Spark doesn't use a UDF for 
sampling now, just like it in Tez. It uses POPoissonSampleSpark for sampling. 

Yes, part of its sampling logic are the same. Such as sorting the sampling 
result and setting parallelism.
>From my point of view. Firstly, we can try to finished these two feature 
>independently. Then, we can refactor the code later on and break 
>getSamplingJob to small functions and extract common functions, so that 
>SkewedJoin can re-use them.

> Implement Skewed join for spark engine
> --------------------------------------
>
>                 Key: PIG-4858
>                 URL: https://issues.apache.org/jira/browse/PIG-4858
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>            Assignee: Xianda Ke
>             Fix For: spark-branch
>
>         Attachments: PIG-4858.patch, PIG-4858_2.patch, PIG-4858_3.patch, 
> SkewedJoinInSparkMode.pdf
>
>
> Now we use regular join to replace skewed join. Need implement it 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to