[
https://issues.apache.org/jira/browse/PIG-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185745#comment-14185745
]
Rohini Palaniswamy commented on PIG-4249:
-----------------------------------------
For a group by + nested foreach followed by order by on the same key, i
recently suggested to modify below pig script
{code}
A = load 'data';
B = group A by id;
C = foreach B {
sorted = order B by $2, $4, $6, $7;
generate CustomUDF(sorted) as mybag;
}
D = foreach C flatter(mybag);
E = order D by id parallel 1;
{code}
where CustomUDF is a not an algebraic or accumulative udf to
{code}
A = load 'data';
B = group A by id;
C = foreach B {
sorted = order B by $2, $4, $6, $7;
generate group as id, CustomUDF(sorted) as mybag;
}
D = order B by id parallel 1;
E = foreach D flatten(mybag);
{code}
This would perform well on MR, but would do really bad on Tez because getting
tuple size for auto parallelism is done in the sample aggregator which is a
single task. Moving that to the sampler tasks greatly reduces the size of data
transferred to sample aggregator and also has the advantage of more parallelism
as getting memory size of a tuple or bag is a costly operation.
> Size estimation should be done in sampler instead of sample aggregator
> ----------------------------------------------------------------------
>
> Key: PIG-4249
> URL: https://issues.apache.org/jira/browse/PIG-4249
> Project: Pig
> Issue Type: Sub-task
> Components: tez
> Reporter: Rohini Palaniswamy
> Fix For: 0.15.0
>
>
> https://reviews.apache.org/r/21302/ comments on Revision 8 - size estimation
> is done in sample aggregator for order by to keep it same as skewed join but
> this can have performance implication if the tuple sizes are big.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)