GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/13018
[SPARK-15240][SQL] Use buffer variables for update/merge expressions
instead duplicate serialization/deserialization in TungstenAggregate
## What changes were proposed in this pull request?
We do serialization/deserialization on aggregation buffer in
`TungstenAggregate` for each aggregation function. It wastes time on duplicate
serde for the same grouping keys.
Instead of deserializing elements from aggregation buffer, updating the
variables then serializing it back, we can use the same variables for the same
grouping keys and only serializing it back when it is needed to change grouping
keys.
## How was this patch tested?
Existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 remove-dup-buffer-serialization
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13018.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13018
----
commit ca88247a6aaa7592aded07cd29838601cc956aa2
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-05-09T08:43:02Z
Use buffer variables foro update/merge expressions instead duplicate
serialization.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]