[
https://issues.apache.org/jira/browse/PIG-4536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531440#comment-14531440
]
Rohini Palaniswamy edited comment on PIG-4536 at 5/6/15 9:21 PM:
-----------------------------------------------------------------
Using Combiner for LIMIT in nested foreach, should also consider the case of
ORDER BY followed by LIMIT.
{code}
group_result = FOREACH data_group {
B = ORDER A by f3 asc;
C = LIMIT A.f3 1;
GENERATE group, A.f3 };
{code}
Combiner should do sorting in this case before applying the limit and that can
be built upon PIG-4449 which will support pushing limit into a sorted bag.
was (Author: rohini):
Using Combiner for LIMIT in nested foreach, should also consider the case of
ORDER BY followed by LIMIT.
group_result = FOREACH data_group
{
B = ORDER A by f3 asc;
C = LIMIT A.f3 1;
GENERATE group, A.f3 };
Combiner should do sorting in this case before applying the limit and that can
be built upon PIG-4449 which will support pushing limit into a sorted bag.
> LIMIT inside nested foreach should have combiner optimization
> -------------------------------------------------------------
>
> Key: PIG-4536
> URL: https://issues.apache.org/jira/browse/PIG-4536
> Project: Pig
> Issue Type: Improvement
> Reporter: Rohini Palaniswamy
> Labels: Performance
>
> {code}
> data_group = GROUP A BY (f1, f2) PARALLEL 100;
> group_result = FOREACH data_group {
> B = LIMIT A.f3 1;
> GENERATE group, SUM(A.f3), SUM(A.f4), SUM(A.f5), SUM(A.f6),FLATTEN(B);
> };
> {code}
> A script like this has combiner optimization turned off and so consumes a lot
> of memory and is slow. We should implement LIMIT using Combiner in cases like
> this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)