Perhaps something based on stateful DoFn so there is a simple decision
point at which each element is either sampled or not so it can be output to
one PCollection or the other. Without doing a little research, I don't
recall if this is doable in the way you need.

Kenn

On Wed, Dec 23, 2020 at 3:12 PM Ismaël Mejía <[email protected]> wrote:

> Thanks for the answer Robert. Producing a combiner with two lists as
> outputs was one idea I was considering too but I was afraid of
> OutOfMemory issues. I had not thought much about the consequences on
> combining state, thanks for pointing that. For the particular sampling
> use case it might be not an issue, or am I missing something?
>
> I am still curious if for Sampling there could be another approach to
> achieve the same goal of producing the same result (uniform sample +
> the rest) but without the issues of combining.
>
> On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw <[email protected]>
> wrote:
> >
> > There are two ways to emit multiple outputs: either to multiple distinct
> PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a
> single PCollection (the difference between Map and FlatMap). In full
> generality, one can always have a CombineFn that outputs lists (say <tag,
> result>*) followed by a DoFn that emits to multiple places based on this
> result.
> >
> > One other cons of emitting multiple values from a CombineFn is that they
> are used in other contexts as well, e.g. combining state, and trying to
> make sense of a multi-outputting CombineFn in that context is trickier.
> >
> > Note that for Sample in particular, it works as a CombineFn because we
> throw most of the data away. If we kept most of the data, it likely
> wouldn't fit into one machine to do the final sampling. The idea of using a
> side input to filter after the fact should work well (unless there's
> duplicate elements, in which case you'd have to uniquify them somehow to
> filter out only the "right" copies).
> >
> > - Robert
> >
> >
> >
> > On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <[email protected]> wrote:
> >>
> >> I had a question today from one of our users about Beam’s Sample
> >> transform (a Combine with an internal top-like function to produce a
> >> uniform sample of size n of a PCollection). They wanted to obtain also
> >> the rest of the PCollection as an output (the non sampled elements).
> >>
> >> My suggestion was to use the sample (since it was little) as a side
> >> input and then reprocess the collection to filter its elements,
> >> however I wonder if this is the ‘best’ solution.
> >>
> >> I was thinking also if Combine is essentially GbK + ParDo why we don’t
> >> have a Combine function with multiple outputs (maybe an evolution of
> >> CombineWithContext). I know this sounds weird and I have probably not
> >> thought much about issues or the performance of the translation but I
> >> wanted to see what others thought, does this make sense, do you see
> >> some pros/cons or other ideas.
> >>
> >> Thanks,
> >> Ismaël
>

Reply via email to