[ 
https://issues.apache.org/jira/browse/PIG-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4148:
-------------------------------
    Attachment: metric_retention.explain

I am also attaching the explain output of my job. To summarize, it has 
{{group-by => order-by => store}}, and the {{order-by}} is consistently skewed. 
For the above reason, FindQunatiles UDF builds the quantiles list that is 
biased, and two tasks are given 70% of total records-
{code}
Total rows: 52548775
Partitions: 0~299
Partition 292: 11552505 (22%)
Partition 299: 25000602 (47%)
{code}

> Tez order-by is often skewed because FindQuantiles UDF is called with small 
> number
> ----------------------------------------------------------------------------------
>
>                 Key: PIG-4148
>                 URL: https://issues.apache.org/jira/browse/PIG-4148
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Cheolsoo Park
>             Fix For: 0.14.0
>
>         Attachments: generate_sample.py, metric_retention.explain, 
> popackage.log, samples_logs.tar.gz
>
>
> In Tez, FindQuantiles UDF is called with a smaller number of samples than MR 
> resulting in skew in range partitions.
> For example, I have a job that runs sampling with a parallelism of 300. Since 
> each task samples 100 records, the total sample should be 30K. But 
> FindQuantiles UDF is called with only 300 samples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to