Re: Aggregations

Fabian Hueske Mon, 08 Sep 2014 15:31:49 -0700

Having aggregation functions only returning a single value, is not very
helpful IMO.
First, an aggregation function should also work on grouped data sets, i.e.,
return one aggregate for each group. Hence, the grouping keys must be
included in the result somehow.
Second, imaging a use case where the min, max, and avg value of some fields
of a tuple are needed. If this would be computed with multiple independent
aggregation functions, the data set would be shuffled and reduced three
times and possibly joined again.


I think it should be possible to combine multiple aggregation functions,
e.g., compute a result with field 2 as grouping key, the minimum and
maximum of field 3 and the average of field 5.
Basically, have something like the project operator but with aggregation
functions and keys. This is also what I sketched in my proposal.

@Hermann: Regarding the reduce function with custom return type, do you
have some concrete use case in mind for that?

Cheers, Fabian

2014-09-08 14:20 GMT+02:00 Hermann Gábor <[email protected]>:

> I also agree on using the minBy as the default mechanism.
>
> If both min and minBy are needed, it would seem more natural for min (and
> also for sum) to return only the given field of the tuple in my opinion.
>
> More generally a reduce function with a custom return type would also be
> useful in my view. In that case the user would also give a value of type T
> to begin the reduction with, and implement a function which reduces a value
> and a value of type T and return a value of type T. Would that make sense?
>

Re: Aggregations

Reply via email to