Thanks for the answer Robert. Producing a combiner with two lists as outputs was one idea I was considering too but I was afraid of OutOfMemory issues. I had not thought much about the consequences on combining state, thanks for pointing that. For the particular sampling use case it might be not an issue, or am I missing something?
I am still curious if for Sampling there could be another approach to achieve the same goal of producing the same result (uniform sample + the rest) but without the issues of combining. On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw <[email protected]> wrote: > > There are two ways to emit multiple outputs: either to multiple distinct > PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a > single PCollection (the difference between Map and FlatMap). In full > generality, one can always have a CombineFn that outputs lists (say <tag, > result>*) followed by a DoFn that emits to multiple places based on this > result. > > One other cons of emitting multiple values from a CombineFn is that they are > used in other contexts as well, e.g. combining state, and trying to make > sense of a multi-outputting CombineFn in that context is trickier. > > Note that for Sample in particular, it works as a CombineFn because we throw > most of the data away. If we kept most of the data, it likely wouldn't fit > into one machine to do the final sampling. The idea of using a side input to > filter after the fact should work well (unless there's duplicate elements, in > which case you'd have to uniquify them somehow to filter out only the "right" > copies). > > - Robert > > > > On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <[email protected]> wrote: >> >> I had a question today from one of our users about Beam’s Sample >> transform (a Combine with an internal top-like function to produce a >> uniform sample of size n of a PCollection). They wanted to obtain also >> the rest of the PCollection as an output (the non sampled elements). >> >> My suggestion was to use the sample (since it was little) as a side >> input and then reprocess the collection to filter its elements, >> however I wonder if this is the ‘best’ solution. >> >> I was thinking also if Combine is essentially GbK + ParDo why we don’t >> have a Combine function with multiple outputs (maybe an evolution of >> CombineWithContext). I know this sounds weird and I have probably not >> thought much about issues or the performance of the translation but I >> wanted to see what others thought, does this make sense, do you see >> some pros/cons or other ideas. >> >> Thanks, >> Ismaël
