[ https://issues.apache.org/jira/browse/PIG-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cheolsoo Park updated PIG-4148: ------------------------------- Attachment: metric_retention.explain I am also attaching the explain output of my job. To summarize, it has {{group-by => order-by => store}}, and the {{order-by}} is consistently skewed. For the above reason, FindQunatiles UDF builds the quantiles list that is biased, and two tasks are given 70% of total records- {code} Total rows: 52548775 Partitions: 0~299 Partition 292: 11552505 (22%) Partition 299: 25000602 (47%) {code} > Tez order-by is often skewed because FindQuantiles UDF is called with small > number > ---------------------------------------------------------------------------------- > > Key: PIG-4148 > URL: https://issues.apache.org/jira/browse/PIG-4148 > Project: Pig > Issue Type: Sub-task > Components: tez > Reporter: Cheolsoo Park > Fix For: 0.14.0 > > Attachments: generate_sample.py, metric_retention.explain, > popackage.log, samples_logs.tar.gz > > > In Tez, FindQuantiles UDF is called with a smaller number of samples than MR > resulting in skew in range partitions. > For example, I have a job that runs sampling with a parallelism of 300. Since > each task samples 100 records, the total sample should be 30K. But > FindQuantiles UDF is called with only 300 samples. -- This message was sent by Atlassian JIRA (v6.3.4#6332)