[
https://issues.apache.org/jira/browse/PIG-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054282#comment-14054282
]
Rohini Palaniswamy commented on PIG-4049:
-----------------------------------------
Actually what I said in the earlier comment about pulling just data of task0 is
sufficient is not right. I had forgotten the actual reasoning I talked about a
input that fetches from tasks in order in the initial comment instead of custom
routing. We do perform limit on each of the tasks from task0 to taskn. But say
if the limit was 10K and if task0 produced 10K or more records it is fine. But
if the task0 produced less than 10K records then we need to pull the rest of
the records from task1 and others. I will go ahead and create the Tez jira for
that.
> Improve performance of Limit following an Orderby on Tez
> --------------------------------------------------------
>
> Key: PIG-4049
> URL: https://issues.apache.org/jira/browse/PIG-4049
> Project: Pig
> Issue Type: Sub-task
> Components: tez
> Reporter: Rohini Palaniswamy
> Fix For: 0.14.0
>
>
> Better algorithms can be applied to improve performance for limit following
> an order by.
> For eg:
> {code}
> A = LOAD '/tmp/data' ...;
> B = ORDER A by $0 parallel 100;
> C = LIMIT B 100;
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)