Thanks for such quick feedback!
> On Thu, Feb 5, 2026 at 3:27 PM Danny McCormick via dev < > [email protected]> wrote: > >> Would you mind opening up the doc for comments? >> >> At a high level, I'm skeptical of the pattern; it seems to me like it >> moves the burden of choosing the correct behavior from authors to consumers >> in non-obvious ways which range from harmless to potentially causing silent >> data loss. I think if a user wants to drop a PCollection, that should >> always be an active choice since the risk of data loss is much greater than >> the EoU benefit of extra code. >> >> I think perhaps I poorly chose a few motivating examples, but it was at least helpful in clarifying two distinct patterns. - Filters/Samplers/Deduplicators - Transforms that may run into issues with certain inputs > I'd argue that a better pattern than having a single transform which >> handles this is to either have a *Filter *or a *Partition* transform >> which a user can use as needed. These are different transforms because they >> have different purposes/core functionalities. >> >> This can become unwieldy for a large library of filtering / sampling / data processing transforms. At Schrodinger for example, we may have maybe a dozen transforms some of which... - are samplers where most consumers will just be interested in the "sample", while other consumers may be interested in both the sample and remaining - are data processing transforms with a concept of processed outputs and "dropped for well-understood reason" We'd likely need to double the size of our library in order to have both Filter and Partition versions of these transforms. > > A parser that routes malformed input to a dead-letter output >> > A validator that routes violations separately >> > An enrichment that routes lookup failures aside >> >> These are the ones I'm really worried about. In all of these cases, we >> are silently dropping error output in a way that might be non-obvious to a >> user. As a user, if I use a parser that returns a single output, I would >> assume that any parse failures would lead to exceptions. >> >> I agree that it'd be an antipattern for these types of transforms to silently capture and drop these erroneous records, but there is nothing preventing an author of parser/validator/enrichment transform from doing this today even without ForkedPCollections. With ForkedPCollections, I think we can and still should discourage transform authors from silently handling errors without some active user configuration (e.g. by requiring as a keyword arg `error_handling_pcoll_name= "failed" to enable any error capturing at all). e.g. ``` parsed = pcoll | ParseData() # parsed.failed --> should not exist, ParseData should not automatically do this parsed = pcoll | ParseData(failed_pcoll_tag="failed") # parsed.failed --> exists now but only with user intent ``` > With all that said, I am aligned with the goal of making pipelines like >> this easier to chain. Maybe an in between option would be adding a DoFn >> utility like: >> >> ``` >> pcoll | Partition(...).keep_tag('main') | ChainedParDo() >> ``` >> >> Where `keep_tag` forces an expansion where all tags other than main are >> dropped. What do you think? >> >> This would help but this solution would be limited to ParDos. If you have a composite transform like a sophisticated `CompositeFilter` or `CompositeSampler`, then you wouldn't be able to use `.keep_tag`. Best, Joey Thanks, >> Danny >> >> On Thu, Feb 5, 2026 at 3:04 PM Joey Tran <[email protected]> >> wrote: >> >>> Hey everyone, >>> >>> My team and I have been running into an awkward pattern with the python >>> and YAML SDK when we have transforms that have one "main" output that we >>> want to be able to ergonomically chain, and other "side" outputs that are >>> useful in some situations. I put together a brief design proposal for a new >>> PCollection type to make this easier - would appreciate any feedback or >>> thoughts. Open to different names as well. >>> >>> ForkedPCollection Design Doc >>> <https://docs.google.com/document/d/10kx8hVrF8JfdeIS6X1vjiADk0IRMTVup9aX3u-GthOo/edit?tab=t.0> >>> >>> Thanks! >>> Joey >>> >>> -- >>> >>> Joey Tran | Staff Developer | AutoDesigner TL >>> >>> *he/him* >>> >>> [image: Schrödinger, Inc.] <https://schrodinger.com/> >>> >>
