[ 
https://issues.apache.org/jira/browse/PIG-108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583334#action_12583334
 ] 

Alan Gates commented on PIG-108:
--------------------------------

Perhaps my test wasn't large enough to show the difference.  The query I ran 
was:

a = load '/user/pig/tests/data/singlefile/studenttab20m';
b = group a by $0;
c = foreach b generate group, COUNT($1);
dump c;

As the name suggests, there are 20m records in the file.  There are 676 
distinct groups in $0.  I ran it on an 8 machine cluster.  The average time 
without your changes was 4m50s, with your changes 4m48s.  

Your change is going to do better as the number of groups increases, so tests 
with larger numbers of distinct groups might show a larger performance 
differential.

Did you do some performance profiling that suggested that this was a bottleneck?

> PigCombine does not use configure method and therefore de-serialize and 
> instantiate objects with every reduce call
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-108
>                 URL: https://issues.apache.org/jira/browse/PIG-108
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.1.0
>            Reporter: Stefan Groschupf
>            Priority: Critical
>             Fix For: 0.1.0
>
>         Attachments: PIG-108-r639015-v1.patch
>
>
> There some significant space for improvement in the PigCombine. 
> In each reduce call some objects are deserialized from the jobConf and also 
> the object graph is generated again and again. 
> Hadoop garanties to call the configure method before a run through and things 
> like inputCount can be than cached as fields. 
> During reduce calls the jobConf will not change so re deserialization and 
> instantiation of all this objects 
> pigContext, evalPipe, inputCount, oc, finalout, esp and so on and so on, 
> makes no sense from my point of view.
> Not sure how often the PigCombine is used, but it will significant improve 
> performance if we fix this.
> Was there any reason to do things like this or is that just historical? 
> As soon the test suite is running again, I would be happy to work on a patch 
> if there is no other options about that. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to