erikerlandson commented on issue #25024: [SPARK-27296][SQL] User Defined 
Aggregators that do not ser/de on each input row
URL: https://github.com/apache/spark/pull/25024#issuecomment-536226337
 
 
   cc @rxin @cloud-fan @hvanhovell 
   In this latest push I added a proof-of-concept solution based on adding a 
`Column` generating method (apply) to `Aggregator[IN, BUF, OUT]`.  It has some 
pros and cons relative to my previous `UserDefinedImperativeAggregator` (UDIA), 
which is still also in this PR.
   
   pros:
   * does not add a new aggregating class
   * has comparable efficiency to UDIA (only does ser/de on partition 
boundaries)
   * I have shown it can work with user defined types, as demonstrated in the 
(temporary) file `tdigest.scala`
   
   cons:
   * can only aggregate over a single value in a row, unlike UDAF and UDIA. For 
example, [this kind of aggregation on multiple columns of the input 
row](https://github.com/apache/spark/blob/72795a9a1583fc25eb0e7663771f746d4401cb5b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/UDIAQuerySuite.scala#L98)
 is not possible for an Aggregator based solution.
   * Aggregator does not seem to have a concept of specifying whether an 
aggregation is deterministic or not, it assumes all aggregations are 
deterministic. This seems wrong to me, and either way is different than how 
UDAF and UDIA work.
   * The processing of input rows is less flexible.  For example, if an 
Aggregator with type IN as Double is declared, it will fail on a column of 
integer values. This is not necessarily true for UDAF and UDIA, if the input 
row values are read in the right way. There may be a way to add input casting, 
but I do not currently see it.
   
   In summary, doing this with enhancements to Aggregator is definitely 
feasible, however I do not think it can provide total feature parity with UDAF 
or UDIA.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to