Combine.perKey takes a single SerializableFunction which knows how to convert from Iterable<V> to V.
It turns out that many runners implement optimizations which allow them to run the combine operation across several machines to parallelize the work and potentially reduce the amount of data they store during a GBK. To be able to do such an optimization, it requires you to actually have three functions: InputT -> AccumulatorT : Creates the intermediate representation which allows for associative combining Iterable<AccumulatorT> -> AccumulatorT: Performs the actual combining AccumT -> OutputT: Extracts the output In the case of Combine.perKey with a SerializableFunction, your providing Iterable<AccumulatorT> -> AccumulatorT and the other two functions are the identity functions. To be able to support a Combine.perKey which can go from Iterable<InputT> -> OutputT would require that this occurred within a single machine removing the parallelization benefits that runners provide and for almost all cases is not a good idea. On Wed, Oct 26, 2016 at 6:23 PM, Manu Zhang <owenzhang1...@gmail.com> wrote: > Hi all, > > I'm wondering why `Combine.perKey(SerializableFunction)` requires input > and > output to be of the same type while `Combine.PerKey` doesn't have this > restriction. > > Thanks, > Manu >