erikerlandson commented on issue #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row URL: https://github.com/apache/spark/pull/25024#issuecomment-509339589 @rxin the key difference is in the `update` methods. The standard UDAF requires that the aggregator be stored in a MutableAggregationBuffer and so a UDAF update method always has this basic form: ```scala def update(buf: MutableAggregationBuffer, input: Row): Unit = { val agg = buf.getAs[AggregatorType](0) // UDT deserializes the aggregator from 'buf' agg.update(input) // update the state of your aggregation buf(0) = agg // UDT re-serializes the aggregator back into buf } ``` The consequence of this is that it is calling deserialize and (re)serialize for the actual aggregating structure for every single input row. If your dataframe has a million rows, it's doing ser/de on your aggregator a million times, not just at the end of each data partition. Compare that with the UDAI (which is driven by TypedImperativeAggregate) ```scala def update(agg: AggregatorType, input: Row): AggregatorType = { agg.update(input) // update the state of your aggregator from the input agg return the aggregator } ``` You can see that here, there is no ser/de of the aggregator at all, when processing input rows (which is as it should be). The TypedImperativeAggregate only invokes ser/de on the aggregator when it is collecting partial results across partitions (and at the end when it is presenting final results into the output data frame). So, imagine a data-frame with 10 partitions and 1 million rows. The UDAF does ser/de on the aggregator a million (plus 10) times, while the UDIA does ser/de only 10 times.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
