Re: Combine with multiple outputs case Sample and the rest

Etienne Chauchot Fri, 15 Jan 2021 01:34:54 -0800

Hi all,

Regarding leveraging the Pardo part of Combine (Combine <=> GBK + Pardo)to have multiple outputs, please note that most of the time Combine istranslated by the runners with a native (destination-tech) Combine andnot a GBK + Pardo.

Regarding using the Stateful DoFn I agree with Kenn with the littleexception that Statefull DoFn is not supported in streaming mode withSpark runner.


But I guess, Ismaël, that the use case is batch mode.

Best

Etienne

On 05/01/2021 15:00, Kenneth Knowles wrote:

Perhaps something based on stateful DoFn so there is a simple decisionpoint at which each element is either sampled or not so it can beoutput to one PCollection or the other. Without doing a littleresearch, I don't recall if this is doable in the way you need.


Kenn

On Wed, Dec 23, 2020 at 3:12 PM Ismaël Mejía <[email protected]<mailto:[email protected]>> wrote:


    Thanks for the answer Robert. Producing a combiner with two lists as
    outputs was one idea I was considering too but I was afraid of
    OutOfMemory issues. I had not thought much about the consequences on
    combining state, thanks for pointing that. For the particular sampling
    use case it might be not an issue, or am I missing something?

    I am still curious if for Sampling there could be another approach to
    achieve the same goal of producing the same result (uniform sample +
    the rest) but without the issues of combining.

    On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw
    <[email protected] <mailto:[email protected]>> wrote:
    >
    > There are two ways to emit multiple outputs: either to multiple
    distinct PCollections (e.g. withOutputTags) or multiple (including
    0) outputs to a single PCollection (the difference between Map and
    FlatMap). In full generality, one can always have a CombineFn that
    outputs lists (say <tag, result>*) followed by a DoFn that emits
    to multiple places based on this result.
    >
    > One other cons of emitting multiple values from a CombineFn is
    that they are used in other contexts as well, e.g. combining
    state, and trying to make sense of a multi-outputting CombineFn in
    that context is trickier.
    >
    > Note that for Sample in particular, it works as a CombineFn
    because we throw most of the data away. If we kept most of the
    data, it likely wouldn't fit into one machine to do the final
    sampling. The idea of using a side input to filter after the fact
    should work well (unless there's duplicate elements, in which case
    you'd have to uniquify them somehow to filter out only the "right"
    copies).
    >
    > - Robert
    >
    >
    >
    > On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <[email protected]
    <mailto:[email protected]>> wrote:
    >>
    >> I had a question today from one of our users about Beam’s Sample
    >> transform (a Combine with an internal top-like function to
    produce a
    >> uniform sample of size n of a PCollection). They wanted to
    obtain also
    >> the rest of the PCollection as an output (the non sampled
    elements).
    >>
    >> My suggestion was to use the sample (since it was little) as a side
    >> input and then reprocess the collection to filter its elements,
    >> however I wonder if this is the ‘best’ solution.
    >>
    >> I was thinking also if Combine is essentially GbK + ParDo why
    we don’t
    >> have a Combine function with multiple outputs (maybe an
    evolution of
    >> CombineWithContext). I know this sounds weird and I have
    probably not
    >> thought much about issues or the performance of the translation
    but I
    >> wanted to see what others thought, does this make sense, do you see
    >> some pros/cons or other ideas.
    >>
    >> Thanks,
    >> Ismaël

Re: Combine with multiple outputs case Sample and the rest

Reply via email to