[ 
https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442613#comment-13442613
 ] 

Dmitriy V. Ryaboy commented on PIG-2888:
----------------------------------------

none of the PigMix queries hit the particular bad behavior this is meant to 
address. I've verified that the speed is on par with the previous 
implementation for those "good" use cases.

Here is a script for which Pig with this patch finishes in 57 seconds, while 
without the patch, it takes 13 mins 48 secs:

{code}
rmf tmp/delme
l = load 'data.txt';
x = foreach l generate $0 as l, (int) (RANDOM() * 10000) as num; 
g = foreach (group x by num % 100) { d = distinct x.num; generate SUM(d); }
store g into 'tmp/delme';
{code}

Data file contains about 7 million rows, 1 letter each. 
This is an intentionally skewed example, but we've encountered similar problems 
with real data, particularly when grouping by high-cardinality columns like 
user_id and subsequently performing algebraic operations on nested distincts.
                
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: partialagg_patch_1.patch, partialagg_patch_2.patch, 
> partialagg_patch_3.patch
>
>
> During performance testing, we found that POPartialAgg can cause performance 
> degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't 
> well suited to the operator's assumptions. Changing the implementation to a 
> more flexible hash-based model can provide significant performance 
> improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to