Yes, I guess we could infer the types from the input or the used aggregation function (Avg -> Double, Cnt -> Long). We also thought about removing the explicit types from the project() operator (FLINK-1040).
I see, the custom type reduce might be useful. However, we should be careful not to bloat the API too much. Not sure if it is useful/important enough. Opinions? 2014-09-09 11:42 GMT+02:00 Hermann Gábor <[email protected]>: > The only advantage of returning a single value instead of the whole tuple > would be having smaller data. I agree, it is not that useful, and the logic > that you proposed earlier could simply provide this with a single > aggregation. > > In addition, isn't it possible to provide the mechanism in your proposal, > without > the user needing to set the return types? Can the types be extracted from > the > tuple and the aggregation (e.g. average should be a Double)? > > A simple example of the custom return type reduce function is a modified > WordCount: > > public class WC { > public String word; > public int count; > // [...] > } > > public class WordCounter implements ReduceFunction<String, WC> { > > @Override > public WC reduce(String word, WC reductionValue) { > return new WC(word, 1 + reductionValue.count); > } > } > > groupedWords.reduce(new WordCounter(), new WC(null, 0)); > > > (Of course this can be easily done with an aggregation, but this was > the simplest use case I could come up with.) > > The only advantage here is also the smaller/clearer value and maybe > generality. > Functional languages like Haskell support this kind of reduction on > collections > (that is the reason I thought about this). On the other side, there are > many drawbacks > of a reduce function like this (it cannot combine two separately reduced > set of data, > the user must provide an initial value and every reduction like this can be > done with larger > tuples). It is not clear for me whether it would be better or not, but I > thought it's worth consideration. > > Cheers, > Gabor > > > > On Tue, Sep 9, 2014 at 12:30 AM, Fabian Hueske <[email protected]> wrote: > > > Having aggregation functions only returning a single value, is not very > > helpful IMO. > > First, an aggregation function should also work on grouped data sets, > i.e., > > return one aggregate for each group. Hence, the grouping keys must be > > included in the result somehow. > > Second, imaging a use case where the min, max, and avg value of some > fields > > of a tuple are needed. If this would be computed with multiple > independent > > aggregation functions, the data set would be shuffled and reduced three > > times and possibly joined again. > > > > I think it should be possible to combine multiple aggregation functions, > > e.g., compute a result with field 2 as grouping key, the minimum and > > maximum of field 3 and the average of field 5. > > Basically, have something like the project operator but with aggregation > > functions and keys. This is also what I sketched in my proposal. > > > > @Hermann: Regarding the reduce function with custom return type, do you > > have some concrete use case in mind for that? > > > > Cheers, Fabian > > > > 2014-09-08 14:20 GMT+02:00 Hermann Gábor <[email protected]>: > > > > > I also agree on using the minBy as the default mechanism. > > > > > > If both min and minBy are needed, it would seem more natural for min > (and > > > also for sum) to return only the given field of the tuple in my > opinion. > > > > > > More generally a reduce function with a custom return type would also > be > > > useful in my view. In that case the user would also give a value of > type > > T > > > to begin the reduction with, and implement a function which reduces a > > value > > > and a value of type T and return a value of type T. Would that make > > sense? > > > > > >
