Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/993#issuecomment-50247741
Hey @rxin, thanks for the careful review! I think I've addressed most of
your comments. Regarding the GeneratedAggregate code, I'm happy to sit down
and explain in more detail at some point. One thing to note is that its only
used in a few circumstances at the moment, and its pretty easy to switch off if
we find it causes problems in the future. That said, I think there is the
possibility of pretty huge speed up here. Here's an example that demonstrates
how the rewrite happens for a simple query:
`hql("SELECT AVG(key) + 1 FROM src GROUP BY value")`
Partial Aggregation:
```
Initial values: 0,CAST(0, LongType)
Grouping Projection: value#25
Update Expressions: if (IS NOT NULL CAST(key#24, LongType))
(currentCount#30L + 1) else currentCount#30L,Coalesce((CAST(key#24, LongType) +
currentSum#31L),currentSum#31L)
Result Projection: value#25,currentCount#30L AS
PartialCount#27L,currentSum#31L AS PartialSum#26L
```
The updates compute the new currentSum and currentCount given an update
buffer joined with the input row.
Final aggregation:
```
Initial values: CAST(0, LongType),CAST(0, LongType)
Grouping Projection: value#25
Update Expressions: Coalesce((PartialSum#26L +
currentSum#28L),currentSum#28L),Coalesce((PartialCount#27L +
currentSum#29L),currentSum#29L)
Result Projection: ((CAST(currentSum#28L, DoubleType) /
CAST(currentSum#29L, DoubleType)) + 1.0) AS c_0#22
```
The updates calculate the sum of the counts and partial sums. The result
divides them and adds 1.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---