erikerlandson commented on issue #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row URL: https://github.com/apache/spark/pull/25024#issuecomment-536226337 cc @rxin @cloud-fan @hvanhovell In this latest push I added a proof-of-concept solution based on adding a `Column` generating method (apply) to `Aggregator[IN, BUF, OUT]`. It has some pros and cons relative to my previous `UserDefinedImperativeAggregator` (UDIA), which is still also in this PR. pros: * does not add a new aggregating class * has comparable efficiency to UDIA (only does ser/de on partition boundaries) * I have shown it can work with user defined types, as demonstrated in the (temporary) file `tdigest.scala` cons: * can only aggregate over a single value in a row, unlike UDAF and UDIA. For example, [this kind of aggregation on multiple columns of the input row](https://github.com/apache/spark/blob/72795a9a1583fc25eb0e7663771f746d4401cb5b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/UDIAQuerySuite.scala#L98) is not possible for an Aggregator based solution. * Aggregator does not seem to have a concept of specifying whether an aggregation is deterministic or not, it assumes all aggregations are deterministic. This seems wrong to me, and either way is different than how UDAF and UDIA work. * The processing of input rows is less flexible. For example, if an Aggregator with type IN as Double is declared, it will fail on a column of integer values. This is not necessarily true for UDAF and UDIA, if the input row values are read in the right way. There may be a way to add input casting, but I do not currently see it. In summary, doing this with enhancements to Aggregator is definitely feasible, however I do not think it can provide total feature parity with UDAF or UDIA.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
