[
https://issues.apache.org/jira/browse/PIG-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053217#comment-14053217
]
Rohini Palaniswamy commented on PIG-4049:
-----------------------------------------
The plan will have
Vertex 1 - Load and sample
Vertex 2 - sample aggregation
Vertex 3 - Range Partitioner
Vertex 4 - Order by reducer with limit on each task
Vertex 5 - Limit task with parallelism one
In the current implementation Vertex 4 (Orderby) writes key, values with key as
(taskindex, recordindex) and value as the orderby value. Vertex 5 does shuffle
merge of all task inputs from Vertex 4 to keep it in sorted order and then
applies the limit.
Two optimizations can be done based on adding some features to Tez itself.
1) Adding a limit feature to OnFileSortedOutput. If that was available, then
we can limit records in Vertex 3 also before writing to each part file after
WeightedRangePartitioner is applied.
2) For vertex 5, we need a input that fetches from source tasks in order and
reads them in order. Since Vertex 4 output is range partitioned, data produced
is ordered from task0,task1...taskn and so can be consumed without shuffle and
sort. If the limit is hit early it can skip fetching more task inputs. This
can avoid the sort and shuffle and fetching of all inputs in vertex 5 now.
> Improve performance of Limit following an Orderby on Tez
> --------------------------------------------------------
>
> Key: PIG-4049
> URL: https://issues.apache.org/jira/browse/PIG-4049
> Project: Pig
> Issue Type: Sub-task
> Components: tez
> Reporter: Rohini Palaniswamy
> Fix For: 0.14.0
>
>
> Better algorithms can be applied to improve performance for limit following
> an order by.
> For eg:
> {code}
> A = LOAD '/tmp/data' ...;
> B = ORDER A by $0 parallel 100;
> C = LIMIT B 100;
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)