Let's come up with a comprehensive design that works for both Batch and Streaming API.
It would be good to include aggregation functions that internally break down into multiple aggregations (like AVG breaking down to a count and sum), respecting that no aggregate is computed twice unnecessarily. On Tue, Sep 9, 2014 at 12:30 AM, Fabian Hueske <[email protected]> wrote: > Having aggregation functions only returning a single value, is not very > helpful IMO. > First, an aggregation function should also work on grouped data sets, i.e., > return one aggregate for each group. Hence, the grouping keys must be > included in the result somehow. > Second, imaging a use case where the min, max, and avg value of some fields > of a tuple are needed. If this would be computed with multiple independent > aggregation functions, the data set would be shuffled and reduced three > times and possibly joined again. > > I think it should be possible to combine multiple aggregation functions, > e.g., compute a result with field 2 as grouping key, the minimum and > maximum of field 3 and the average of field 5. > Basically, have something like the project operator but with aggregation > functions and keys. This is also what I sketched in my proposal. > > @Hermann: Regarding the reduce function with custom return type, do you > have some concrete use case in mind for that? > > Cheers, Fabian > > 2014-09-08 14:20 GMT+02:00 Hermann Gábor <[email protected]>: > > > I also agree on using the minBy as the default mechanism. > > > > If both min and minBy are needed, it would seem more natural for min (and > > also for sum) to return only the given field of the tuple in my opinion. > > > > More generally a reduce function with a custom return type would also be > > useful in my view. In that case the user would also give a value of type > T > > to begin the reduction with, and implement a function which reduces a > value > > and a value of type T and return a value of type T. Would that make > sense? > > >
