[jira] [Commented] (PIG-4249) Size estimation should be done in sampler instead of sample aggregator

Rohini Palaniswamy (JIRA) Mon, 27 Oct 2014 13:29:14 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185745#comment-14185745
 ]


Rohini Palaniswamy commented on PIG-4249:
-----------------------------------------

For a group by + nested foreach followed by order by on the same key, i 
recently suggested to modify below pig script

{code}
A = load 'data';
B = group A by id;
C = foreach B {
   sorted = order B by $2, $4, $6, $7;
   generate CustomUDF(sorted) as mybag;
}
D = foreach C flatter(mybag);
E = order D by id parallel 1;
{code}

where CustomUDF is a not an algebraic or accumulative udf to
{code}
A = load 'data';
B = group A by id;
C = foreach B {
   sorted = order B by $2, $4, $6, $7;
   generate group as id, CustomUDF(sorted) as mybag;
}
D = order B by id parallel 1;
E = foreach D flatten(mybag);
{code}

  This would perform well on MR, but would do really bad on Tez because getting 
tuple size for auto parallelism is done in the sample aggregator which is a 
single task. Moving that to the sampler tasks greatly reduces the size of data 
transferred to sample aggregator and also has the advantage of more parallelism 
as getting memory size of a tuple or bag is a costly operation. 

> Size estimation should be done in sampler instead of sample aggregator
> ----------------------------------------------------------------------
>
>                 Key: PIG-4249
>                 URL: https://issues.apache.org/jira/browse/PIG-4249
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Rohini Palaniswamy
>             Fix For: 0.15.0
>
>
> https://reviews.apache.org/r/21302/ comments on Revision 8 - size estimation 
> is done in sample aggregator for order by to keep it same as skewed join but 
> this can have performance implication if the tuple sizes are big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4249) Size estimation should be done in sampler instead of sample aggregator

Reply via email to