Having aggregation functions only returning a single value, is not very helpful IMO. First, an aggregation function should also work on grouped data sets, i.e., return one aggregate for each group. Hence, the grouping keys must be included in the result somehow. Second, imaging a use case where the min, max, and avg value of some fields of a tuple are needed. If this would be computed with multiple independent aggregation functions, the data set would be shuffled and reduced three times and possibly joined again.
I think it should be possible to combine multiple aggregation functions, e.g., compute a result with field 2 as grouping key, the minimum and maximum of field 3 and the average of field 5. Basically, have something like the project operator but with aggregation functions and keys. This is also what I sketched in my proposal. @Hermann: Regarding the reduce function with custom return type, do you have some concrete use case in mind for that? Cheers, Fabian 2014-09-08 14:20 GMT+02:00 Hermann Gábor <[email protected]>: > I also agree on using the minBy as the default mechanism. > > If both min and minBy are needed, it would seem more natural for min (and > also for sum) to return only the given field of the tuple in my opinion. > > More generally a reduce function with a custom return type would also be > useful in my view. In that case the user would also give a value of type T > to begin the reduction with, and implement a function which reduces a value > and a value of type T and return a value of type T. Would that make sense? >
