[ 
https://issues.apache.org/jira/browse/PIG-3649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Woodruff updated PIG-3649:
---------------------------------

    Attachment: PIG-3469.patch

Attaching path that updates {{aggregate()}} to count number of result tuples.

Includes two new tests:
- One that shows that basic aggregation of multiple columns works
- Another that reproduces the issue reported here. This requires aggregating > 
10,000 rows, so it is a bit slow. Suggestions for alternative approaches 
welcome.



> POPartialAgg incorrectly calculates size reduction when multiple values 
> aggregated
> ----------------------------------------------------------------------------------
>
>                 Key: PIG-3649
>                 URL: https://issues.apache.org/jira/browse/PIG-3649
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.11, 0.12.0, 0.11.1
>            Reporter: Travis Woodruff
>         Attachments: PIG-3469.patch
>
>
> {{POPartialAgg.aggregate()}} counts the number of output columns 
> ({{valueTuple.size() - 1}}), but {{checkSizeReduction()}} compares this to 
> the number of input tuples. 
> When multiple columns are aggregated, this causes the reduction factor to be 
> calculated as too high by a factor of the number of columns being aggregated, 
> which causes in-memory aggregation to be disabled when it should not be, 
> adversely affecting performance,



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to