Thanks for the thorough explanation. I see the benefits for such a function.
My follow-up question is whether this is a hard requirement.
There are computations that don't satisfy this (I think it's monoid rule)
but possible and easier to write with
Combine.perKey(SerializableFunction<Iterable<InputT>, OutputT>). It's not
difficult to provide an underlying CombineFn.


On Thu, Oct 27, 2016 at 9:47 AM Lukasz Cwik <lc...@google.com.invalid>
wrote:

> Combine.perKey takes a single SerializableFunction which knows how to
> convert from Iterable<V> to V.
>
> It turns out that many runners implement optimizations which allow them to
> run the combine operation across several machines to parallelize the work
> and potentially reduce the amount of data they store during a GBK.
> To be able to do such an optimization, it requires you to actually have
> three functions:
> InputT -> AccumulatorT : Creates the intermediate representation which
> allows for associative combining
> Iterable<AccumulatorT> -> AccumulatorT: Performs the actual combining
> AccumT -> OutputT: Extracts the output
>
> In the case of Combine.perKey with a SerializableFunction, your providing
> Iterable<AccumulatorT> -> AccumulatorT and the other two functions are the
> identity functions.
>
> To be able to support a Combine.perKey which can go from Iterable<InputT>
> -> OutputT would require that this occurred within a single machine
> removing the parallelization benefits that runners provide and for almost
> all cases is not a good idea.
>
> On Wed, Oct 26, 2016 at 6:23 PM, Manu Zhang <owenzhang1...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I'm wondering why `Combine.perKey(SerializableFunction)` requires input
> > and
> > output to be of the same type while `Combine.PerKey` doesn't have this
> > restriction.
> >
> > Thanks,
> > Manu
> >
>

Reply via email to