erikerlandson commented on issue #25024: [SPARK-27296][SQL] User Defined 
Aggregators that do not ser/de on each input row
URL: https://github.com/apache/spark/pull/25024#issuecomment-509357118
 
 
   With respect to "fixing UDAF" (instead of creating a new UDIA), I have 
convinced myself there is no path there, but here is where I went with that: as 
described above, the basic pattern of a UDAF `update` method is:
   ```scala
   def update(buf: MutableAggregationBuffer, input: Row): Unit = {
     val agg = buf.getAs[AggregatorType](0)  // UDT deserializes the aggregator 
from 'buf'
     agg.update(input)    // update the state of your aggregation
     buf(0) = agg    // UDT re-serializes the aggregator back into buf
   }
   ```
   So, the problem arises out of the UDT, which does ser/de.  In theory, IF you 
could just store `agg` directly into the `buf`, as a raw object reference, then 
this would not require any actual ser/de, and the UDAF would almost certainly 
be efficient.
   
   However, if you try this trick (and I did), Spark will crash with an 
"unrecognized data type" exception, because it only allows defined subclasses 
of `DataType` to be stored in `Row`s. It also allows UDTs, but of course these 
are required to encode the user's custom type in terms of defined subclasses of 
`DataType`. 
   
   I do not think Spark/Catalyst can be made to cope with raw object references 
in Row objects, as it needs to know how to operate on whatever objects live in 
Rows. It requires a "closed universe" of possible DataTypes. Even if you 
disabled the enforcement and allowed arbitrary object references in Rows, it 
would break spark.
   
   As an aside, spark arguably _already_ has two parallel aggregator 
interfaces: UDAF and TypedImperativeAggregate (which is what all the predefined 
aggregators use). This PR is exposing that second one to users.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to