Yes, I guess we could infer the types from the input or the used
aggregation function (Avg -> Double, Cnt -> Long).
We also thought about removing the explicit types from the project()
operator (FLINK-1040).

I see, the custom type reduce might be useful.
However, we should be careful not to bloat the API too much.
Not sure if it is useful/important enough.

Opinions?

2014-09-09 11:42 GMT+02:00 Hermann Gábor <[email protected]>:

> The only advantage of returning a single value instead of the whole tuple
> would be having smaller data. I agree, it is not that useful, and the logic
> that you proposed earlier could simply provide this with a single
> aggregation.
>
> In addition, isn't it possible to provide the mechanism in your proposal,
> without
> the user needing to set the return types? Can the types be extracted from
> the
> tuple and the aggregation (e.g. average should be a Double)?
>
> A simple example of the custom return type reduce function is a modified
> WordCount:
>
>         public class WC {
>                 public String word;
>                 public int count;
>                 // [...]
>         }
>
>         public class WordCounter implements ReduceFunction<String, WC> {
>
>                 @Override
>                 public WC reduce(String word, WC reductionValue) {
>                         return new WC(word, 1 + reductionValue.count);
>                 }
>         }
>
>         groupedWords.reduce(new WordCounter(), new WC(null, 0));
>
>
> (Of course this can be easily done with an aggregation, but this was
> the simplest use case I could come up with.)
>
> The only advantage here is also the smaller/clearer value and maybe
> generality.
> Functional languages like Haskell support this kind of reduction on
> collections
> (that is the reason I thought about this). On the other side, there are
> many drawbacks
> of a reduce function like this (it cannot combine two separately reduced
> set of data,
> the user must provide an initial value and every reduction like this can be
> done with larger
> tuples). It is not clear for me whether it would be better or not, but I
> thought it's worth consideration.
>
> Cheers,
> Gabor
>
>
>
> On Tue, Sep 9, 2014 at 12:30 AM, Fabian Hueske <[email protected]> wrote:
>
> > Having aggregation functions only returning a single value, is not very
> > helpful IMO.
> > First, an aggregation function should also work on grouped data sets,
> i.e.,
> > return one aggregate for each group. Hence, the grouping keys must be
> > included in the result somehow.
> > Second, imaging a use case where the min, max, and avg value of some
> fields
> > of a tuple are needed. If this would be computed with multiple
> independent
> > aggregation functions, the data set would be shuffled and reduced three
> > times and possibly joined again.
> >
> > I think it should be possible to combine multiple aggregation functions,
> > e.g., compute a result with field 2 as grouping key, the minimum and
> > maximum of field 3 and the average of field 5.
> > Basically, have something like the project operator but with aggregation
> > functions and keys. This is also what I sketched in my proposal.
> >
> > @Hermann: Regarding the reduce function with custom return type, do you
> > have some concrete use case in mind for that?
> >
> > Cheers, Fabian
> >
> > 2014-09-08 14:20 GMT+02:00 Hermann Gábor <[email protected]>:
> >
> > > I also agree on using the minBy as the default mechanism.
> > >
> > > If both min and minBy are needed, it would seem more natural for min
> (and
> > > also for sum) to return only the given field of the tuple in my
> opinion.
> > >
> > > More generally a reduce function with a custom return type would also
> be
> > > useful in my view. In that case the user would also give a value of
> type
> > T
> > > to begin the reduction with, and implement a function which reduces a
> > value
> > > and a value of type T and return a value of type T. Would that make
> > sense?
> > >
> >
>

Reply via email to