Combine.perKey takes a single SerializableFunction which knows how to
convert from Iterable<V> to V.

It turns out that many runners implement optimizations which allow them to
run the combine operation across several machines to parallelize the work
and potentially reduce the amount of data they store during a GBK.
To be able to do such an optimization, it requires you to actually have
three functions:
InputT -> AccumulatorT : Creates the intermediate representation which
allows for associative combining
Iterable<AccumulatorT> -> AccumulatorT: Performs the actual combining
AccumT -> OutputT: Extracts the output

In the case of Combine.perKey with a SerializableFunction, your providing
Iterable<AccumulatorT> -> AccumulatorT and the other two functions are the
identity functions.

To be able to support a Combine.perKey which can go from Iterable<InputT>
-> OutputT would require that this occurred within a single machine
removing the parallelization benefits that runners provide and for almost
all cases is not a good idea.

On Wed, Oct 26, 2016 at 6:23 PM, Manu Zhang <owenzhang1...@gmail.com> wrote:

> Hi all,
>
> I'm wondering why `Combine.perKey(SerializableFunction)` requires input
> and
> output to be of the same type while `Combine.PerKey` doesn't have this
> restriction.
>
> Thanks,
> Manu
>

Reply via email to