[GitHub] [spark] erikerlandson commented on issue #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row

GitBox Sat, 28 Sep 2019 14:19:43 -0700

erikerlandson commented on issue #25024: [SPARK-27296][SQL] User Defined
Aggregators that do not ser/de on each input row
URL: https://github.com/apache/spark/pull/25024#issuecomment-536226337

cc @rxin @cloud-fan @hvanhovell
In this latest push I added a proof-of-concept solution based on adding a
`Column` generating method (apply) to `Aggregator[IN, BUF, OUT]`. It has some
pros and cons relative to my previous `UserDefinedImperativeAggregator` (UDIA),
which is still also in this PR.

pros:
* does not add a new aggregating class
* has comparable efficiency to UDIA (only does ser/de on partition
boundaries)
* I have shown it can work with user defined types, as demonstrated in the
(temporary) file `tdigest.scala`

cons:
* can only aggregate over a single value in a row, unlike UDAF and UDIA. For
example, [this kind of aggregation on multiple columns of the input
row](https://github.com/apache/spark/blob/72795a9a1583fc25eb0e7663771f746d4401cb5b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/UDIAQuerySuite.scala#L98)
is not possible for an Aggregator based solution.
* Aggregator does not seem to have a concept of specifying whether an
aggregation is deterministic or not, it assumes all aggregations are
deterministic. This seems wrong to me, and either way is different than how
UDAF and UDIA work.
* The processing of input rows is less flexible. For example, if an
Aggregator with type IN as Double is declared, it will fail on a column of
integer values. This is not necessarily true for UDAF and UDIA, if the input
row values are read in the right way. There may be a way to add input casting,
but I do not currently see it.

In summary, doing this with enhancements to Aggregator is definitely
feasible, however I do not think it can provide total feature parity with UDAF
or UDIA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] erikerlandson commented on issue #25024: [SPARK-27296][SQL] User Defined Aggregators that do not ser/de on each input row

Reply via email to