There are two ways to emit multiple outputs: either to multiple distinct
PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a
single PCollection (the difference between Map and FlatMap). In full
generality, one can always have a CombineFn that outputs lists (say <tag,
result>*) followed by a DoFn that emits to multiple places based on this
result.

One other cons of emitting multiple values from a CombineFn is that they
are used in other contexts as well, e.g. combining state, and trying to
make sense of a multi-outputting CombineFn in that context is trickier.

Note that for Sample in particular, it works as a CombineFn because we
throw most of the data away. If we kept most of the data, it likely
wouldn't fit into one machine to do the final sampling. The idea of using a
side input to filter after the fact should work well (unless there's
duplicate elements, in which case you'd have to uniquify them somehow
to filter out only the "right" copies).

- Robert



On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <ieme...@gmail.com> wrote:

> I had a question today from one of our users about Beam’s Sample
> transform (a Combine with an internal top-like function to produce a
> uniform sample of size n of a PCollection). They wanted to obtain also
> the rest of the PCollection as an output (the non sampled elements).
>
> My suggestion was to use the sample (since it was little) as a side
> input and then reprocess the collection to filter its elements,
> however I wonder if this is the ‘best’ solution.
>
> I was thinking also if Combine is essentially GbK + ParDo why we don’t
> have a Combine function with multiple outputs (maybe an evolution of
> CombineWithContext). I know this sounds weird and I have probably not
> thought much about issues or the performance of the translation but I
> wanted to see what others thought, does this make sense, do you see
> some pros/cons or other ideas.
>
> Thanks,
> Ismaël
>

Reply via email to