Re: [DISCUSS] Proposal: ForkedPCollections in Python SDK

Robert Bradshaw via dev Thu, 05 Feb 2026 15:07:19 -0800

On Thu, Feb 5, 2026 at 1:44 PM Joey Tran <[email protected]> wrote:


> Thanks for such quick feedback!
>
> On Thu, Feb 5, 2026 at 3:27 PM Danny McCormick via dev <
>> [email protected]> wrote:
>>
>>> Would you mind opening up the doc for comments?
>>>
>>> At a high level, I'm skeptical of the pattern; it seems to me like it
>>> moves the burden of choosing the correct behavior from authors to consumers
>>> in non-obvious ways which range from harmless to potentially causing silent
>>> data loss. I think if a user wants to drop a PCollection, that should
>>> always be an active choice since the risk of data loss is much greater than
>>> the EoU benefit of extra code.
>>>
>>> I think perhaps I poorly chose a few motivating examples, but it was at
> least helpful in clarifying two distinct patterns.
>   - Filters/Samplers/Deduplicators
>   - Transforms that may run into issues with certain inputs
>

One usecase that comes to mind is running some computation with an
(safely ignorable) side output that has statistics about what was
computed/encountered.


>
>
>> I'd argue that a better pattern than having a single transform which
>>> handles this is to either have a *Filter *or a *Partition* transform
>>> which a user can use as needed. These are different transforms because they
>>> have different purposes/core functionalities.
>>>
>>> This can become unwieldy for a large library of filtering / sampling /
> data processing transforms. At Schrodinger for example, we may have maybe a
> dozen transforms some of which...
>   - are samplers where most consumers will just be interested in the
> "sample", while other consumers may be interested in both the sample and
> remaining
>   - are data processing transforms with a concept of processed outputs and
> "dropped for well-understood reason"
>

I prefer transforms that can be used multiple ways than distinct transforms
that do the same thing.

FWIW, our existing DoFn has somewhat this behavior: by default one consumes
only the "main" output, but the caller can ask for the side outputs if
desired. This extends it a bit further in that

1. Composite operations are supported (though I guess technically nothing
is stopping someone from manually constructing a DoFnTuple), and
2. Side outputs are always returned and available for consumption, without
inconveniencing use of the main output, rather than having to explicitly
switch modes at application time.

This seems like a strict win to me.

We'd likely need to double the size of our library in order to have both
> Filter and Partition versions of these transforms.
>
>
>> > A parser that routes malformed input to a dead-letter output
>>> > A validator that routes violations separately
>>> > An enrichment that routes lookup failures aside
>>>
>>> These are the ones I'm really worried about. In all of these cases, we
>>> are silently dropping error output in a way that might be non-obvious to a
>>> user. As a user, if I use a parser that returns a single output, I would
>>> assume that any parse failures would lead to exceptions.
>>>
>>> I agree that it'd be an antipattern for these types of transforms to
> silently capture and drop these erroneous records, but there is nothing
> preventing an author of parser/validator/enrichment transform from doing
> this today even without ForkedPCollections. With ForkedPCollections, I
> think we can and still should discourage transform authors from silently
> handling errors without some active user configuration (e.g. by requiring
> as a keyword arg `error_handling_pcoll_name= "failed" to enable any error
> capturing at all). e.g.
> ```
> parsed = pcoll | ParseData()
> # parsed.failed --> should not exist, ParseData should not automatically
> do this
>
> parsed = pcoll | ParseData(failed_pcoll_tag="failed")
> # parsed.failed --> exists now but only with user intent
> ```
>

+1. Errors should fail the pipeline unless one explicitly requests they be
passed somewhere else.


>
>
>
>> With all that said, I am aligned with the goal of making pipelines like
>>> this easier to chain. Maybe an in between option would be adding a DoFn
>>> utility like:
>>>
>>> ```
>>> pcoll | Partition(...).keep_tag('main') | ChainedParDo()
>>> ```
>>>
>>> Where `keep_tag` forces an expansion where all tags other than main are
>>> dropped. What do you think?
>>>
>>
One can already write

    pcoll | Partition(...)['main'] | ChainedParDo()

This would help but this solution would be limited to ParDos. If you have a
> composite transform like a sophisticated `CompositeFilter` or
> `CompositeSampler`, then you wouldn't be able to use `.keep_tag`.
>
> Best,
> Joey
>
>
>
>
>
>
>
>
> Thanks,
>>> Danny
>>>
>>> On Thu, Feb 5, 2026 at 3:04 PM Joey Tran <[email protected]>
>>> wrote:
>>>
>>>> Hey everyone,
>>>>
>>>> My team and I have been running into an awkward pattern with the python
>>>> and YAML SDK when we have transforms that have one "main" output that we
>>>> want to be able to ergonomically chain, and other "side" outputs that are
>>>> useful in some situations. I put together a brief design proposal for a new
>>>> PCollection type to make this easier - would appreciate any feedback or
>>>> thoughts. Open to different names as well.
>>>>
>>>> ForkedPCollection Design Doc
>>>> <https://docs.google.com/document/d/10kx8hVrF8JfdeIS6X1vjiADk0IRMTVup9aX3u-GthOo/edit?tab=t.0>
>>>>
>>>> Thanks!
>>>> Joey
>>>>
>>>> --
>>>>
>>>> Joey Tran | Staff Developer | AutoDesigner TL
>>>>
>>>> *he/him*
>>>>
>>>> [image: Schrödinger, Inc.] <https://schrodinger.com/>
>>>>
>>>

Re: [DISCUSS] Proposal: ForkedPCollections in Python SDK

Reply via email to