after working with the Dataset and Aggregator apis for a few weeks porting
some fairly complex RDD algos (an overall pleasant experience) i wanted to
summarize the pain points and some suggestions for improvement given my
experience. all of these are already mentioned on mailing list or jira, but
i figured its good to put them in one place.
see below.
best,
koert

*) a lot of practical aggregation functions do not have a zero. this can be
dealt with correctly using null or None as the zero for Aggregator. in
algebird for example this is expressed as converting an algebird.Aggregator
(which does not have a zero) into a algebird.MonoidAggregator (which does
have a zero, so similar to spark Aggregator) by lifting it. see:
https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala#L420
something similar should be possible in spark. however currently Aggregator
does not like its zero to be null or an Option, making this approach
difficult. see:
https://www.mail-archive.com/user@spark.apache.org/msg53106.html
https://issues.apache.org/jira/browse/SPARK-15810

*) KeyValueGroupedDataset.reduceGroups needs to be efficient, probably
using an Aggregator (with null or None as the zero) under the hood. the
current implementation does a flatMapGroups which is suboptimal.

*) KeyValueGroupedDataset needs mapValues. without this porting many algos
from RDD to Dataset is difficult and clumsy. see:
https://issues.apache.org/jira/browse/SPARK-15780

*) Aggregators need to also work within DataFrames (so
RelationalGroupedDataset) without having to fall back on using Row objects
as input. otherwise all code ends up being written twice, once for
Aggregator and once for UserDefinedAggregateFunction/UDAF. this doesn't
make sense to me. my attempt at addressing this:
https://issues.apache.org/jira/browse/SPARK-15769
https://github.com/apache/spark/pull/13512

best,
koert

Reply via email to