Sounds good. Thanks for clarifying. - Cham
On Thu, May 7, 2026, 8:26 AM Yi Hu <[email protected]> wrote: > Hi Cham, > > I hope the original GSoC proposal [1] could answer your question > > > While Splittable DoFn has been introduced as a Beam primitive transform > handling IO sources, UnboundedSource arguably remains an easier API for > users to author their own IOs. > > In the Java SDK, UnboundedSource/UnboundedReader has been > (re)implemented as a wrapper of Splittable DoFn, we can follow the Java > implementation and add it to Python. > > The goal is to make it easier for beginners to author production-ready > (not necessarily very optimized) IOs easier. SDF is versatile but requires > users to implement a couple of methods > working closely, and remarkably self-checkpointing logic; requiring > implementing TruncateOffsetTracker to be able to drain, etc. > While UnboundedSource has a much gradual learning curve. Users can author > a fairly production ready IO by implement only start, advance, (and split > can just return null if not-splittable), > finalizeCheckpoint (relay to the wrapper to handle self-checkpointing > logic). > > I would consider UnboundedSource as a wrapper or utility class that > complements SDF, as Java currently does (if ignoring runner specific > override). > > Thanks, > > Yi > > [1] https://issues.apache.org/jira/browse/GSOC-315 > > On Wed, May 6, 2026 at 9:03 PM Chamikara Jayalath via dev < > [email protected]> wrote: > >> Thanks for the great write up. >> >> I have a high level question. Why would Python SDK need an >> UnboundedSource / UnboundedReader API ? >> >> Java SDK has these since they predated the Splittable DoFn API. We added >> the Unbounded wrapper (as you noted) to make old sources implemented in >> this API available to runners that only support Splittable DoFn. Python >> SDK, on the other hand, only supports portable runners, where Splittable >> DoFn support is available and any new sources can be implemented using >> Splittable DoFn API: >> https://beam.apache.org/documentation/io/developing-io-overview/ >> >> +1 for adding support for the Watch transform. >> >> Thanks, >> Cham >> >> On Mon, May 4, 2026 at 5:24 AM Elia Liu <[email protected]> wrote: >> >>> I'd also like to share two documents with the community for feedback: >>> >>> • Design doc (also serving as my GSoC proposal): >>> >>> https://drive.google.com/file/d/1M_cg0iQh0D5AYyPRRoo88qcs2LKIcENW/view?usp=sharing >>> >>> • Proof of concept: >>> https://github.com/Eliaaazzz/beam-native-streaming-poc >>> >>> The design doc and PoC cover the proposed SDF architecture, watermark >>> and checkpoint semantics, and the test suite. >>> >>> Any feedback, suggestions, or concerns are very welcome, especially >>> around watermark propagation, checkpoint encoding, and runner >>> compatibility. Happy to iterate based on community feedback! >>> >>> On Fri, May 1, 2026 at 12:05 PM Elia LIU <[email protected]> wrote: >>> > >>> > Hi all, >>> > >>> > I'm Elia, a final-year Computer Science student at the University of >>> > Melbourne. I'm writing to introduce myself as one of this year's GSoC >>> > contributors under Apache Beam. Over the next few months I'll be >>> > working with Yi Hu on Native Streaming Transforms for the Python SDK, >>> > and I'm really looking forward to it. >>> > >>> > I've been contributing to Beam in smaller ways over the past several >>> > months, and I've come to deeply appreciate how thoughtful and >>> > welcoming this community is. I'm grateful for the chance to take on a >>> > more sustained piece of work and to engage more closely with all of >>> > you. >>> > >>> > === Project Summary === >>> > >>> > Beam's Python SDK is increasingly the first choice for ML and >>> > data-intensive pipelines, making robust native streaming APIs more >>> > important than ever. However, two essential streaming primitives, >>> > UnboundedSource (https://github.com/apache/beam/issues/19137) and >>> > Watch (https://github.com/apache/beam/issues/21521), remain >>> > unavailable in Python despite being long established in the Java SDK. >>> > Today, Python developers who need these capabilities either pull in >>> > cross-language transforms that add Java dependencies and gRPC >>> > overhead, or wrestle with low-level RestrictionTracker internals. >>> > >>> > This project aims to address the gap by porting both primitives >>> > natively to Python on top of Beam's existing SDF framework. The >>> > UnboundedSource wrapper adapts the legacy reader API into a Splittable >>> > DoFn that handles checkpointing, watermark progression, deduplication, >>> > and splitting. The Watch transform adds periodic polling with >>> > composable termination conditions and stable dedup behavior. >>> > >>> > Porting both primitives natively abstracts that complexity behind >>> > clean and Pythonic interfaces. For ML and data-intensive pipelines, >>> > this means streaming ingestion from sources like Kafka and Pub/Sub, >>> > continuous polling for new training data or model artifacts, and >>> > event-driven feature pipelines can all be expressed natively in >>> > Python, without cross-language overhead or low-level SDF code. >>> > >>> > === Planned Deliverables === >>> > >>> > * D1: A Python UnboundedSource API with its SDF-based wrapper >>> > * D2: A native Watch transform with PollFn and TerminationConditions >>> > * D3: A test suite spanning DirectRunner and Dataflow >>> > * D4: Supporting documentation including docstrings, programming-guide >>> > updates, and migration notes >>> > >>> > I'd genuinely value any feedback, suggestions, or historical context >>> > from those of you who've worked on related areas. I'm sure there are >>> > nuances and edge cases I haven't fully thought through yet, and I'd be >>> > very glad to hear from anyone willing to share thoughts. >>> > >>> > Many thanks to Yi Hu for the mentorship, and to the wider community >>> > for the warm reception so far. Looking forward to working alongside >>> > you all over the coming months. >>> > >>> > Best regards, >>> > Elia >>> >>
