after working with the Dataset and Aggregator apis for a few weeks porting some fairly complex RDD algos (an overall pleasant experience) i wanted to summarize the pain points and some suggestions for improvement given my experience. all of these are already mentioned on mailing list or jira, but i figured its good to put them in one place. see below. best, koert
*) a lot of practical aggregation functions do not have a zero. this can be dealt with correctly using null or None as the zero for Aggregator. in algebird for example this is expressed as converting an algebird.Aggregator (which does not have a zero) into a algebird.MonoidAggregator (which does have a zero, so similar to spark Aggregator) by lifting it. see: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala#L420 something similar should be possible in spark. however currently Aggregator does not like its zero to be null or an Option, making this approach difficult. see: https://www.mail-archive.com/user@spark.apache.org/msg53106.html https://issues.apache.org/jira/browse/SPARK-15810 *) KeyValueGroupedDataset.reduceGroups needs to be efficient, probably using an Aggregator (with null or None as the zero) under the hood. the current implementation does a flatMapGroups which is suboptimal. *) KeyValueGroupedDataset needs mapValues. without this porting many algos from RDD to Dataset is difficult and clumsy. see: https://issues.apache.org/jira/browse/SPARK-15780 *) Aggregators need to also work within DataFrames (so RelationalGroupedDataset) without having to fall back on using Row objects as input. otherwise all code ends up being written twice, once for Aggregator and once for UserDefinedAggregateFunction/UDAF. this doesn't make sense to me. my attempt at addressing this: https://issues.apache.org/jira/browse/SPARK-15769 https://github.com/apache/spark/pull/13512 best, koert