Perhaps something based on stateful DoFn so there is a simple decision point at which each element is either sampled or not so it can be output to one PCollection or the other. Without doing a little research, I don't recall if this is doable in the way you need.
Kenn On Wed, Dec 23, 2020 at 3:12 PM Ismaël Mejía <[email protected]> wrote: > Thanks for the answer Robert. Producing a combiner with two lists as > outputs was one idea I was considering too but I was afraid of > OutOfMemory issues. I had not thought much about the consequences on > combining state, thanks for pointing that. For the particular sampling > use case it might be not an issue, or am I missing something? > > I am still curious if for Sampling there could be another approach to > achieve the same goal of producing the same result (uniform sample + > the rest) but without the issues of combining. > > On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw <[email protected]> > wrote: > > > > There are two ways to emit multiple outputs: either to multiple distinct > PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a > single PCollection (the difference between Map and FlatMap). In full > generality, one can always have a CombineFn that outputs lists (say <tag, > result>*) followed by a DoFn that emits to multiple places based on this > result. > > > > One other cons of emitting multiple values from a CombineFn is that they > are used in other contexts as well, e.g. combining state, and trying to > make sense of a multi-outputting CombineFn in that context is trickier. > > > > Note that for Sample in particular, it works as a CombineFn because we > throw most of the data away. If we kept most of the data, it likely > wouldn't fit into one machine to do the final sampling. The idea of using a > side input to filter after the fact should work well (unless there's > duplicate elements, in which case you'd have to uniquify them somehow to > filter out only the "right" copies). > > > > - Robert > > > > > > > > On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <[email protected]> wrote: > >> > >> I had a question today from one of our users about Beam’s Sample > >> transform (a Combine with an internal top-like function to produce a > >> uniform sample of size n of a PCollection). They wanted to obtain also > >> the rest of the PCollection as an output (the non sampled elements). > >> > >> My suggestion was to use the sample (since it was little) as a side > >> input and then reprocess the collection to filter its elements, > >> however I wonder if this is the ‘best’ solution. > >> > >> I was thinking also if Combine is essentially GbK + ParDo why we don’t > >> have a Combine function with multiple outputs (maybe an evolution of > >> CombineWithContext). I know this sounds weird and I have probably not > >> thought much about issues or the performance of the translation but I > >> wanted to see what others thought, does this make sense, do you see > >> some pros/cons or other ideas. > >> > >> Thanks, > >> Ismaël >
