There are two ways to emit multiple outputs: either to multiple distinct PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a single PCollection (the difference between Map and FlatMap). In full generality, one can always have a CombineFn that outputs lists (say <tag, result>*) followed by a DoFn that emits to multiple places based on this result.
One other cons of emitting multiple values from a CombineFn is that they are used in other contexts as well, e.g. combining state, and trying to make sense of a multi-outputting CombineFn in that context is trickier. Note that for Sample in particular, it works as a CombineFn because we throw most of the data away. If we kept most of the data, it likely wouldn't fit into one machine to do the final sampling. The idea of using a side input to filter after the fact should work well (unless there's duplicate elements, in which case you'd have to uniquify them somehow to filter out only the "right" copies). - Robert On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <ieme...@gmail.com> wrote: > I had a question today from one of our users about Beam’s Sample > transform (a Combine with an internal top-like function to produce a > uniform sample of size n of a PCollection). They wanted to obtain also > the rest of the PCollection as an output (the non sampled elements). > > My suggestion was to use the sample (since it was little) as a side > input and then reprocess the collection to filter its elements, > however I wonder if this is the ‘best’ solution. > > I was thinking also if Combine is essentially GbK + ParDo why we don’t > have a Combine function with multiple outputs (maybe an evolution of > CombineWithContext). I know this sounds weird and I have probably not > thought much about issues or the performance of the translation but I > wanted to see what others thought, does this make sense, do you see > some pros/cons or other ideas. > > Thanks, > Ismaël >