Hi all,
Regarding leveraging the Pardo part of Combine (Combine <=> GBK + Pardo)
to have multiple outputs, please note that most of the time Combine is
translated by the runners with a native (destination-tech) Combine and
not a GBK + Pardo.
Regarding using the Stateful DoFn I agree with Kenn with the little
exception that Statefull DoFn is not supported in streaming mode with
Spark runner.
But I guess, Ismaël, that the use case is batch mode.
Best
Etienne
On 05/01/2021 15:00, Kenneth Knowles wrote:
Perhaps something based on stateful DoFn so there is a simple decision
point at which each element is either sampled or not so it can be
output to one PCollection or the other. Without doing a little
research, I don't recall if this is doable in the way you need.
Kenn
On Wed, Dec 23, 2020 at 3:12 PM Ismaël Mejía <ieme...@gmail.com
<mailto:ieme...@gmail.com>> wrote:
Thanks for the answer Robert. Producing a combiner with two lists as
outputs was one idea I was considering too but I was afraid of
OutOfMemory issues. I had not thought much about the consequences on
combining state, thanks for pointing that. For the particular sampling
use case it might be not an issue, or am I missing something?
I am still curious if for Sampling there could be another approach to
achieve the same goal of producing the same result (uniform sample +
the rest) but without the issues of combining.
On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw
<rober...@google.com <mailto:rober...@google.com>> wrote:
>
> There are two ways to emit multiple outputs: either to multiple
distinct PCollections (e.g. withOutputTags) or multiple (including
0) outputs to a single PCollection (the difference between Map and
FlatMap). In full generality, one can always have a CombineFn that
outputs lists (say <tag, result>*) followed by a DoFn that emits
to multiple places based on this result.
>
> One other cons of emitting multiple values from a CombineFn is
that they are used in other contexts as well, e.g. combining
state, and trying to make sense of a multi-outputting CombineFn in
that context is trickier.
>
> Note that for Sample in particular, it works as a CombineFn
because we throw most of the data away. If we kept most of the
data, it likely wouldn't fit into one machine to do the final
sampling. The idea of using a side input to filter after the fact
should work well (unless there's duplicate elements, in which case
you'd have to uniquify them somehow to filter out only the "right"
copies).
>
> - Robert
>
>
>
> On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <ieme...@gmail.com
<mailto:ieme...@gmail.com>> wrote:
>>
>> I had a question today from one of our users about Beam’s Sample
>> transform (a Combine with an internal top-like function to
produce a
>> uniform sample of size n of a PCollection). They wanted to
obtain also
>> the rest of the PCollection as an output (the non sampled
elements).
>>
>> My suggestion was to use the sample (since it was little) as a side
>> input and then reprocess the collection to filter its elements,
>> however I wonder if this is the ‘best’ solution.
>>
>> I was thinking also if Combine is essentially GbK + ParDo why
we don’t
>> have a Combine function with multiple outputs (maybe an
evolution of
>> CombineWithContext). I know this sounds weird and I have
probably not
>> thought much about issues or the performance of the translation
but I
>> wanted to see what others thought, does this make sense, do you see
>> some pros/cons or other ideas.
>>
>> Thanks,
>> Ismaël