Thanks for the answer Robert. Producing a combiner with two lists as
outputs was one idea I was considering too but I was afraid of
OutOfMemory issues. I had not thought much about the consequences on
combining state, thanks for pointing that. For the particular sampling
use case it might be not an issue, or am I missing something?

I am still curious if for Sampling there could be another approach to
achieve the same goal of producing the same result (uniform sample +
the rest) but without the issues of combining.

On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw <[email protected]> wrote:
>
> There are two ways to emit multiple outputs: either to multiple distinct 
> PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a 
> single PCollection (the difference between Map and FlatMap). In full 
> generality, one can always have a CombineFn that outputs lists (say <tag, 
> result>*) followed by a DoFn that emits to multiple places based on this 
> result.
>
> One other cons of emitting multiple values from a CombineFn is that they are 
> used in other contexts as well, e.g. combining state, and trying to make 
> sense of a multi-outputting CombineFn in that context is trickier.
>
> Note that for Sample in particular, it works as a CombineFn because we throw 
> most of the data away. If we kept most of the data, it likely wouldn't fit 
> into one machine to do the final sampling. The idea of using a side input to 
> filter after the fact should work well (unless there's duplicate elements, in 
> which case you'd have to uniquify them somehow to filter out only the "right" 
> copies).
>
> - Robert
>
>
>
> On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <[email protected]> wrote:
>>
>> I had a question today from one of our users about Beam’s Sample
>> transform (a Combine with an internal top-like function to produce a
>> uniform sample of size n of a PCollection). They wanted to obtain also
>> the rest of the PCollection as an output (the non sampled elements).
>>
>> My suggestion was to use the sample (since it was little) as a side
>> input and then reprocess the collection to filter its elements,
>> however I wonder if this is the ‘best’ solution.
>>
>> I was thinking also if Combine is essentially GbK + ParDo why we don’t
>> have a Combine function with multiple outputs (maybe an evolution of
>> CombineWithContext). I know this sounds weird and I have probably not
>> thought much about issues or the performance of the translation but I
>> wanted to see what others thought, does this make sense, do you see
>> some pros/cons or other ideas.
>>
>> Thanks,
>> Ismaël

Reply via email to