[ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439904#comment-13439904 ]
Dmitriy V. Ryaboy commented on PIG-2888: ---------------------------------------- The current implementation makes a two key assumptions that are frequently violated in real-life datasets and scripts: 1) The intermediate UDF is cheap to invoke 2) Records come in mostly-grouped order (records with the same key tend to follow each other). When condition 2 is not satisfied, POPartialAgg winds up calling the intermediate UDF on all accumulated values so far for a given key, plus a new tuple, for every single tuple it sees. This causes a significant performance degradation. Instead, we propose accumulating tuples across the board until a memory threshold is reached. Once this threshold is reached, all keys and tuples are fed into the intermediate UDF and the results put into a second-level map (presumably, having been significantly shrunk by the intermediate UDF). This repeats until the second-level map hits its threshold, at which point *it* is summarized and its values replaced with the aggregated ones. If after such a reduction the memory occupied by the hashmap is still near the threshold, the results are returned to the regular MR pipeline. > Improve performance of POPartialAgg > ----------------------------------- > > Key: PIG-2888 > URL: https://issues.apache.org/jira/browse/PIG-2888 > Project: Pig > Issue Type: Improvement > Reporter: Dmitriy V. Ryaboy > Assignee: Dmitriy V. Ryaboy > > During performance testing, we found that POPartialAgg can cause performance > degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't > well suited to the operator's assumptions. Changing the implementation to a > more flexible hash-based model can provide significant performance > improvements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira