I've updated the doc to call out the DLQ pattern. Thanks everyone for the feedback.
On Thu, Feb 5, 2026 at 6:07 PM Robert Bradshaw via dev <[email protected]> wrote: > On Thu, Feb 5, 2026 at 1:44 PM Joey Tran <[email protected]> > wrote: > >> Thanks for such quick feedback! >> >> On Thu, Feb 5, 2026 at 3:27 PM Danny McCormick via dev < >>> [email protected]> wrote: >>> >>>> Would you mind opening up the doc for comments? >>>> >>>> At a high level, I'm skeptical of the pattern; it seems to me like it >>>> moves the burden of choosing the correct behavior from authors to consumers >>>> in non-obvious ways which range from harmless to potentially causing silent >>>> data loss. I think if a user wants to drop a PCollection, that should >>>> always be an active choice since the risk of data loss is much greater than >>>> the EoU benefit of extra code. >>>> >>>> I think perhaps I poorly chose a few motivating examples, but it was at >> least helpful in clarifying two distinct patterns. >> - Filters/Samplers/Deduplicators >> - Transforms that may run into issues with certain inputs >> > > One usecase that comes to mind is running some computation with an > (safely ignorable) side output that has statistics about what was > computed/encountered. > > >> >> >>> I'd argue that a better pattern than having a single transform which >>>> handles this is to either have a *Filter *or a *Partition* transform >>>> which a user can use as needed. These are different transforms because they >>>> have different purposes/core functionalities. >>>> >>>> This can become unwieldy for a large library of filtering / sampling / >> data processing transforms. At Schrodinger for example, we may have maybe a >> dozen transforms some of which... >> - are samplers where most consumers will just be interested in the >> "sample", while other consumers may be interested in both the sample and >> remaining >> - are data processing transforms with a concept of processed outputs >> and "dropped for well-understood reason" >> > > I prefer transforms that can be used multiple ways than distinct > transforms that do the same thing. > > FWIW, our existing DoFn has somewhat this behavior: by default one > consumes only the "main" output, but the caller can ask for the side > outputs if desired. This extends it a bit further in that > > 1. Composite operations are supported (though I guess technically nothing > is stopping someone from manually constructing a DoFnTuple), and > 2. Side outputs are always returned and available for consumption, without > inconveniencing use of the main output, rather than having to explicitly > switch modes at application time. > > This seems like a strict win to me. > > We'd likely need to double the size of our library in order to have both >> Filter and Partition versions of these transforms. >> >> >>> > A parser that routes malformed input to a dead-letter output >>>> > A validator that routes violations separately >>>> > An enrichment that routes lookup failures aside >>>> >>>> These are the ones I'm really worried about. In all of these cases, we >>>> are silently dropping error output in a way that might be non-obvious to a >>>> user. As a user, if I use a parser that returns a single output, I would >>>> assume that any parse failures would lead to exceptions. >>>> >>>> I agree that it'd be an antipattern for these types of transforms to >> silently capture and drop these erroneous records, but there is nothing >> preventing an author of parser/validator/enrichment transform from doing >> this today even without ForkedPCollections. With ForkedPCollections, I >> think we can and still should discourage transform authors from silently >> handling errors without some active user configuration (e.g. by requiring >> as a keyword arg `error_handling_pcoll_name= "failed" to enable any error >> capturing at all). e.g. >> ``` >> parsed = pcoll | ParseData() >> # parsed.failed --> should not exist, ParseData should not automatically >> do this >> >> parsed = pcoll | ParseData(failed_pcoll_tag="failed") >> # parsed.failed --> exists now but only with user intent >> ``` >> > > +1. Errors should fail the pipeline unless one explicitly requests they be > passed somewhere else. > > >> >> >> >>> With all that said, I am aligned with the goal of making pipelines like >>>> this easier to chain. Maybe an in between option would be adding a DoFn >>>> utility like: >>>> >>>> ``` >>>> pcoll | Partition(...).keep_tag('main') | ChainedParDo() >>>> ``` >>>> >>>> Where `keep_tag` forces an expansion where all tags other than main are >>>> dropped. What do you think? >>>> >>> > One can already write > > pcoll | Partition(...)['main'] | ChainedParDo() > > This would help but this solution would be limited to ParDos. If you have >> a composite transform like a sophisticated `CompositeFilter` or >> `CompositeSampler`, then you wouldn't be able to use `.keep_tag`. >> >> Best, >> Joey >> >> >> >> >> >> >> >> >> Thanks, >>>> Danny >>>> >>>> On Thu, Feb 5, 2026 at 3:04 PM Joey Tran <[email protected]> >>>> wrote: >>>> >>>>> Hey everyone, >>>>> >>>>> My team and I have been running into an awkward pattern with the >>>>> python and YAML SDK when we have transforms that have one "main" output >>>>> that we want to be able to ergonomically chain, and other "side" outputs >>>>> that are useful in some situations. I put together a brief design proposal >>>>> for a new PCollection type to make this easier - would appreciate any >>>>> feedback or thoughts. Open to different names as well. >>>>> >>>>> ForkedPCollection Design Doc >>>>> <https://docs.google.com/document/d/10kx8hVrF8JfdeIS6X1vjiADk0IRMTVup9aX3u-GthOo/edit?tab=t.0> >>>>> >>>>> Thanks! >>>>> Joey >>>>> >>>>> -- >>>>> >>>>> Joey Tran | Staff Developer | AutoDesigner TL >>>>> >>>>> *he/him* >>>>> >>>>> [image: Schrödinger, Inc.] <https://schrodinger.com/> >>>>> >>>>
