[
https://issues.apache.org/jira/browse/PIG-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cheolsoo Park updated PIG-4148:
-------------------------------
Description:
In Tez, FindQuantiles UDF is called with a smaller number of samples than MR
resulting in skew in range partitions.
For example, I have a job that runs sampling with a parallelism of 300. Since
each task samples 100 records, the total sample should be 30K. But
FindQuantiles UDF is called with only 300 samples.
was:
In Tez, FindQuantiles UDF is called with a smaller number of samples than MR
resulting in skew in range partitions.
For example, I have a job that runs sampling with a parallelism of 300. Since
each task samples 100 records, the total sample should be 30K. But
FindQuantiles UDF is called with only 300 records-
{code}
# Plan on vertex
POValueOutputTez - scope-282 -> [scope-283]
|
|---New For Each(false)[tuple] - scope-281
| |
|
POUserFunc(org.apache.pig.backend.hadoop.executionengine.tez.FindQuantilesTez)[tuple]
- scope-280
| |
| |---Project[tuple][*] - scope-279
|
|---New For Each(false,false)[tuple] - scope-278
| |
| Constant(300) - scope-277 <--- 300 should be 30K!
| |
| Project[bag][1] - scope-275
|
|---Package(Packager)[tuple]{bytearray} - scope-274
{code}
This is because we set the number of samples to the parallelism of the sampling
vertex.
{code}
// We temporarily set it to rp and will adjust it at runtime, because the final
degree of parallelism
// is unknown until we are ready to submit it. See PIG-2779.
rpce.setValue(rp);
{code}
> Tez order-by is often skewed because FindQuantiles UDF is called with small
> number
> ----------------------------------------------------------------------------------
>
> Key: PIG-4148
> URL: https://issues.apache.org/jira/browse/PIG-4148
> Project: Pig
> Issue Type: Sub-task
> Components: tez
> Reporter: Cheolsoo Park
> Assignee: Cheolsoo Park
> Fix For: 0.14.0
>
>
> In Tez, FindQuantiles UDF is called with a smaller number of samples than MR
> resulting in skew in range partitions.
> For example, I have a job that runs sampling with a parallelism of 300. Since
> each task samples 100 records, the total sample should be 30K. But
> FindQuantiles UDF is called with only 300 samples.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)