Hi all,

Regarding leveraging the Pardo part of Combine (Combine <=> GBK + Pardo) to have multiple outputs, please note that most of the time Combine is translated by the runners with a native (destination-tech) Combine and not a GBK + Pardo.

Regarding using the Stateful DoFn I agree with Kenn with the little exception that Statefull DoFn is not supported in streaming mode with Spark runner.

But I guess, Ismaël, that the use case is batch mode.

Best

Etienne

On 05/01/2021 15:00, Kenneth Knowles wrote:
Perhaps something based on stateful DoFn so there is a simple decision point at which each element is either sampled or not so it can be output to one PCollection or the other. Without doing a little research, I don't recall if this is doable in the way you need.

Kenn

On Wed, Dec 23, 2020 at 3:12 PM Ismaël Mejía <ieme...@gmail.com <mailto:ieme...@gmail.com>> wrote:

    Thanks for the answer Robert. Producing a combiner with two lists as
    outputs was one idea I was considering too but I was afraid of
    OutOfMemory issues. I had not thought much about the consequences on
    combining state, thanks for pointing that. For the particular sampling
    use case it might be not an issue, or am I missing something?

    I am still curious if for Sampling there could be another approach to
    achieve the same goal of producing the same result (uniform sample +
    the rest) but without the issues of combining.

    On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw
    <rober...@google.com <mailto:rober...@google.com>> wrote:
    >
    > There are two ways to emit multiple outputs: either to multiple
    distinct PCollections (e.g. withOutputTags) or multiple (including
    0) outputs to a single PCollection (the difference between Map and
    FlatMap). In full generality, one can always have a CombineFn that
    outputs lists (say <tag, result>*) followed by a DoFn that emits
    to multiple places based on this result.
    >
    > One other cons of emitting multiple values from a CombineFn is
    that they are used in other contexts as well, e.g. combining
    state, and trying to make sense of a multi-outputting CombineFn in
    that context is trickier.
    >
    > Note that for Sample in particular, it works as a CombineFn
    because we throw most of the data away. If we kept most of the
    data, it likely wouldn't fit into one machine to do the final
    sampling. The idea of using a side input to filter after the fact
    should work well (unless there's duplicate elements, in which case
    you'd have to uniquify them somehow to filter out only the "right"
    copies).
    >
    > - Robert
    >
    >
    >
    > On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <ieme...@gmail.com
    <mailto:ieme...@gmail.com>> wrote:
    >>
    >> I had a question today from one of our users about Beam’s Sample
    >> transform (a Combine with an internal top-like function to
    produce a
    >> uniform sample of size n of a PCollection). They wanted to
    obtain also
    >> the rest of the PCollection as an output (the non sampled
    elements).
    >>
    >> My suggestion was to use the sample (since it was little) as a side
    >> input and then reprocess the collection to filter its elements,
    >> however I wonder if this is the ‘best’ solution.
    >>
    >> I was thinking also if Combine is essentially GbK + ParDo why
    we don’t
    >> have a Combine function with multiple outputs (maybe an
    evolution of
    >> CombineWithContext). I know this sounds weird and I have
    probably not
    >> thought much about issues or the performance of the translation
    but I
    >> wanted to see what others thought, does this make sense, do you see
    >> some pros/cons or other ideas.
    >>
    >> Thanks,
    >> Ismaël

Reply via email to