erikerlandson commented on issue #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row URL: https://github.com/apache/spark/pull/25024#issuecomment-509357118 With respect to "fixing UDAF" (instead of creating a new UDIA), I have convinced myself there is no path there, but here is where I went with that: as described above, the basic pattern of a UDAF `update` method is: ```scala def update(buf: MutableAggregationBuffer, input: Row): Unit = { val agg = buf.getAs[AggregatorType](0) // UDT deserializes the aggregator from 'buf' agg.update(input) // update the state of your aggregation buf(0) = agg // UDT re-serializes the aggregator back into buf } ``` So, the problem arises out of the UDT, which does ser/de. In theory, IF you could just store `agg` directly into the `buf`, as a raw object reference, then this would not require any actual ser/de, and the UDAF would almost certainly be efficient. However, if you try this trick (and I did), Spark will crash with an "unrecognized data type" exception, because it only allows defined subclasses of `DataType` to be stored in `Row`s. It also allows UDTs, but of course these are required to encode the user's custom type in terms of defined subclasses of `DataType`. I do not think Spark/Catalyst can be made to cope with raw object references in Row objects, as it needs to know how to operate on whatever objects live in Rows. It requires a "closed universe" of possible DataTypes. Even if you disabled the enforcement and allowed arbitrary object references in Rows, it would break spark. As an aside, spark arguably _already_ has two parallel aggregator interfaces: UDAF and TypedImperativeAggregate (which is what all the predefined aggregators use). This PR is exposing that second one to users.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
