[GitHub] [spark] hvanhovell opened a new pull request #26012: [SPARK-29346] Add Aggregating Accumulator

GitBox Thu, 03 Oct 2019 05:59:53 -0700

hvanhovell opened a new pull request #26012: [SPARK-29346] Add Aggregating 
Accumulator
URL: https://github.com/apache/spark/pull/26012
 
 
   ### What changes were proposed in this pull request?
   This PR adds an accumulator that computes a global aggregate over a number 
of rows. A user can define an arbitrary number of aggregate functions which can 
be computed at the same time.
   
   The accumulator uses the standard technique for implementing (interpreted) 
aggregation in Spark. It uses projections and manual updates for each of the 
aggregation steps (initialize buffer, update buffer with new input row, merge 
two buffers and compute the final result on the buffer). Note that two of the 
steps (update and merge) use the aggregation buffer both as input and output.
   
   Accumulators do not have an explicit point at which they get serialized. A 
somewhat surprising side effect is that the buffers of a 
`TypedImperativeAggregate` go over the wire as-is instead of serializing them. 
The merging logic for `TypedImperativeAggregate` assumes that the input buffer 
contains serialized buffers, this is violated by the accumulator's implicit 
serialization. In order to get around this I have added `mergeBuffersObjects` 
method that merges two unserialized buffers to `TypedImperativeAggregate`.
   
   ### Why are the changes needed?
   This is the mechanism we are going to use to implement observable metrics.
   
   ### Does this PR introduce any user-facing change?
   No, not yet.
   
   ### How was this patch tested?
   Added `AggregatingAccumulator` test suite.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] hvanhovell opened a new pull request #26012: [SPARK-29346] Add Aggregating Accumulator

Reply via email to