Re: [DISCUSS] Proposal: ForkedPCollections in Python SDK

Danny McCormick via dev Thu, 12 Feb 2026 14:52:34 -0800

My preference would be enabling `pcoll | Partition(...)['main'] |
ChainedParDo()`, but I think I'm currently the only one with significant
objections - I tried to make time for someone to join my dissent :)


Given that, I'm ok with proceeding with roughly the original proposal
(factoring conversation in the doc); my only request would be that we
document the transform in a way that clearly discourages putting
error/exception outputs in the secondary PCollection, and makes it clear
that this is primarily for use cases where the main PCollection is
sufficient for most use cases.

Thanks,
Danny

On Tue, Feb 10, 2026 at 4:42 PM Joey Tran <[email protected]> wrote:

> Just want to bump this. In what direction should we go here?
>
> On Fri, Feb 6, 2026 at 5:49 PM Joey Tran <[email protected]>
> wrote:
>
>>
>>
>> On Fri, Feb 6, 2026 at 5:43 PM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> On Fri, Feb 6, 2026 at 2:36 PM Joey Tran <[email protected]>
>>> wrote:
>>> >
>>> > On Fri, Feb 6, 2026 at 4:43 PM Danny McCormick <
>>> [email protected]> wrote:
>>> >>
>>> >> On Fri, Feb 6, 2026 at 4:22 PM Joey Tran <[email protected]>
>>> wrote:
>>> >>>
>>> >>> FWIW, much of the value of this proposal to me is the better
>>> readability from not having to consider multiple versions of transforms and
>>> not having to break up chains to extract main outputs. I appreciate though
>>> that we'd be making a trade-off of readability of the "sad path" for
>>> readability of the "happy path"
>>> >>
>>> >>
>>> >> Yeah, that makes sense; what do you think of the other alternative
>>> mentioned as an option for optimizing for both kinds of readability?
>>> Specifically, allowing:
>>> >>
>>> >>    pcoll | Partition(...)['main'] | ChainedParDo()
>>> >>
>>> >> I guess the downside there is education (all pipeline authors need to
>>> know this is an option as opposed to only one expert transform author), but
>>> I'm curious if it is sufficient for your context.
>>> >
>>> > Is the suggestion here to implement `__getitem__` on PTransform/ParDo
>>> so a particular pcollection can be specified? This would definitely be an
>>> improvement from the current state. I think one further improvement would
>>> be if we could specify the pcollection by attribute rather than by
>>> key/string, so `Partition(...).main` instead, but that risks pcollection
>>> name and ptransform method collisions.
>>> >
>>> > I'm still partial toward the other suggestions, particularly towards
>>> implementing `PTransform.with_outputs`, but this is probably sufficient for
>>> my context.
>>>
>>> I'll admit that I'm actually not a fan of with_outputs(...). It's not
>>> very dry--I'd rather the consumer decide what it wants to consume by
>>> consuming it than have to also (redundantly) specify it on the
>>> producer. I think it dates back to trying to copy java where the
>>> return type needs to be a typed PValue. Were I to do it again, I would
>>> have such transforms return a dict or named tuple (if all outputs are
>>> meaningful) or an "augmented" PCollection (as has been proposed here)
>>> when they are auxiliary (and preferably leave the decision up to the
>>> DoFn implementor, not the caller).
>>>
>>> - Robert
>>>
>>
>> Ha, yeah I also don't find it the most intuitively named / parametrized.
>> I usually need to look at it's documentation each time I need to use it.
>> Standardization is nice though.
>>
>

Re: [DISCUSS] Proposal: ForkedPCollections in Python SDK

Reply via email to