[ 
https://issues.apache.org/jira/browse/PIG-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053217#comment-14053217
 ] 

Rohini Palaniswamy commented on PIG-4049:
-----------------------------------------

The plan will have 

Vertex 1 - Load and sample
Vertex 2 - sample aggregation
Vertex 3 - Range Partitioner
Vertex 4 - Order by reducer with limit on each task
Vertex 5 - Limit task with parallelism one

In the current implementation Vertex 4 (Orderby) writes key, values with key as 
(taskindex, recordindex) and value as the orderby value. Vertex 5 does shuffle 
merge of all task inputs from Vertex 4 to keep it in sorted order and then 
applies the limit.

Two optimizations can be done based on adding some features to Tez itself.
  1) Adding a limit feature to OnFileSortedOutput. If that was available, then 
we can limit records in Vertex 3 also before writing to each part file after 
WeightedRangePartitioner is applied.
  2) For vertex 5, we need a input that fetches from source tasks in order and 
reads them in order. Since Vertex 4 output is range partitioned, data produced 
is ordered from task0,task1...taskn and so can be consumed without shuffle and 
sort. If the limit is hit early it can skip fetching more task inputs.  This 
can avoid the sort and shuffle and fetching of all inputs in vertex 5 now.

> Improve performance of Limit following an Orderby on Tez
> --------------------------------------------------------
>
>                 Key: PIG-4049
>                 URL: https://issues.apache.org/jira/browse/PIG-4049
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Rohini Palaniswamy
>             Fix For: 0.14.0
>
>
> Better algorithms can be applied to improve performance for limit following 
> an order by.
> For eg:
> {code}
> A = LOAD '/tmp/data' ...;
> B = ORDER A by $0 parallel 100;
> C = LIMIT B 100;
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to