erikerlandson commented on issue #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row URL: https://github.com/apache/spark/pull/25024#issuecomment-545054005 @rdblue @cloud-fan, I hesitate to propose this because it would be a fourth iteration on this PR, but: IF we are willing to alter the class signature for `Aggregator` for 3.0 release, there is an opportunity to simplify things: ```scala // note IN is no longer contravariant // potentially, all encoder info becomes implicit abstract class Aggregator[IN, BUF, OUT] extends Serializable { def zero: BUF def reduce(b: BUF, a: IN): BUF def merge(b1: BUF, b2: BUF): BUF def finish(reduction: BUF): OUT def apply(exprs: Column*)(implicit eIN: Encoder[IN], eBUF: Encoder[BUF], eOUT: Encoder[OUT]: Column = // untyped aggregator def toColumn(implicit eBUF: Encoder[BUF], eOUT: Encoder[OUT]: TypedColumn[IN,OUT] = // typed aggregator } ``` The above should allow us to get rid of the intermediate `UserDefinedAggregator`. This can be combined with a modernizing refactor of moving `Encoder` implicits into the `Encoder` companion object. Another option would be to default `bufferEncoder` and `outputEncoder` to implicits, and allow them to be overridden for cases where implicits might not cover someone's case (like TDigest). This might be safer from an "escape hatch" perspective.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
