[ 
https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439904#comment-13439904
 ] 

Dmitriy V. Ryaboy commented on PIG-2888:
----------------------------------------

The current implementation makes a two key assumptions that are frequently 
violated in real-life datasets and scripts:

1) The intermediate UDF is cheap to invoke
2) Records come in mostly-grouped order (records with the same key tend to 
follow each other).

When condition 2 is not satisfied, POPartialAgg winds up calling the 
intermediate UDF on all accumulated values so far for a given key, plus a new 
tuple, for every single tuple it sees. This causes a significant performance 
degradation.

Instead, we propose accumulating tuples across the board until a memory 
threshold is reached. Once this threshold is reached, all keys and tuples are 
fed into the intermediate UDF and the results put into a second-level map 
(presumably, having been significantly shrunk by the intermediate UDF).  This 
repeats until the second-level map hits its threshold, at which point *it* is 
summarized and its values replaced with the aggregated ones. If after such a 
reduction the memory occupied by the hashmap is still near the threshold, the 
results are returned to the regular MR pipeline.
                
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>
> During performance testing, we found that POPartialAgg can cause performance 
> degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't 
> well suited to the operator's assumptions. Changing the implementation to a 
> more flexible hash-based model can provide significant performance 
> improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to