[ 
https://issues.apache.org/jira/browse/PIG-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4148:
-------------------------------
    Attachment: PIG-4148-1.patch

The patch changes the number of samples to parallelism x per-task sample size.

> Tez order-by is often skewed because FindQuantiles UDF is called with small 
> number
> ----------------------------------------------------------------------------------
>
>                 Key: PIG-4148
>                 URL: https://issues.apache.org/jira/browse/PIG-4148
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>             Fix For: 0.14.0
>
>         Attachments: PIG-4148-1.patch
>
>
> In Tez, FindQuantiles UDF is called with a smaller number of samples than MR 
> resulting in skew in range partitions.
> For example, I have a job that runs sampling with a parallelism of 300. Since 
> each task samples 100 records, the total sample should be 30K. But 
> FindQuantiles UDF is called with only 300 records-
> {code}
> # Plan on vertex
> POValueOutputTez - scope-282    ->   [scope-283]
> |
> |---New For Each(false)[tuple] - scope-281
>     |   |
>     |   
> POUserFunc(org.apache.pig.backend.hadoop.executionengine.tez.FindQuantilesTez)[tuple]
>  - scope-280
>     |   |
>     |   |---Project[tuple][*] - scope-279
>     |
>     |---New For Each(false,false)[tuple] - scope-278
>         |   |
>         |   Constant(300) - scope-277 <--- 300 should be 30K!
>         |   |
>         |   Project[bag][1] - scope-275
>         |
>         |---Package(Packager)[tuple]{bytearray} - scope-274
> {code}
> This is because we set the number of samples to the parallelism of the 
> sampling vertex.
> {code}
> // We temporarily set it to rp and will adjust it at runtime, because the 
> final degree of parallelism
> // is unknown until we are ready to submit it. See PIG-2779.
> rpce.setValue(rp);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to