Hi Cham, I hope the original GSoC proposal [1] could answer your question
> While Splittable DoFn has been introduced as a Beam primitive transform handling IO sources, UnboundedSource arguably remains an easier API for users to author their own IOs. > In the Java SDK, UnboundedSource/UnboundedReader has been (re)implemented as a wrapper of Splittable DoFn, we can follow the Java implementation and add it to Python. The goal is to make it easier for beginners to author production-ready (not necessarily very optimized) IOs easier. SDF is versatile but requires users to implement a couple of methods working closely, and remarkably self-checkpointing logic; requiring implementing TruncateOffsetTracker to be able to drain, etc. While UnboundedSource has a much gradual learning curve. Users can author a fairly production ready IO by implement only start, advance, (and split can just return null if not-splittable), finalizeCheckpoint (relay to the wrapper to handle self-checkpointing logic). I would consider UnboundedSource as a wrapper or utility class that complements SDF, as Java currently does (if ignoring runner specific override). Thanks, Yi [1] https://issues.apache.org/jira/browse/GSOC-315 On Wed, May 6, 2026 at 9:03 PM Chamikara Jayalath via dev < [email protected]> wrote: > Thanks for the great write up. > > I have a high level question. Why would Python SDK need an UnboundedSource > / UnboundedReader API ? > > Java SDK has these since they predated the Splittable DoFn API. We added > the Unbounded wrapper (as you noted) to make old sources implemented in > this API available to runners that only support Splittable DoFn. Python > SDK, on the other hand, only supports portable runners, where Splittable > DoFn support is available and any new sources can be implemented using > Splittable DoFn API: > https://beam.apache.org/documentation/io/developing-io-overview/ > > +1 for adding support for the Watch transform. > > Thanks, > Cham > > On Mon, May 4, 2026 at 5:24 AM Elia Liu <[email protected]> wrote: > >> I'd also like to share two documents with the community for feedback: >> >> • Design doc (also serving as my GSoC proposal): >> >> https://drive.google.com/file/d/1M_cg0iQh0D5AYyPRRoo88qcs2LKIcENW/view?usp=sharing >> >> • Proof of concept: >> https://github.com/Eliaaazzz/beam-native-streaming-poc >> >> The design doc and PoC cover the proposed SDF architecture, watermark >> and checkpoint semantics, and the test suite. >> >> Any feedback, suggestions, or concerns are very welcome, especially >> around watermark propagation, checkpoint encoding, and runner >> compatibility. Happy to iterate based on community feedback! >> >> On Fri, May 1, 2026 at 12:05 PM Elia LIU <[email protected]> wrote: >> > >> > Hi all, >> > >> > I'm Elia, a final-year Computer Science student at the University of >> > Melbourne. I'm writing to introduce myself as one of this year's GSoC >> > contributors under Apache Beam. Over the next few months I'll be >> > working with Yi Hu on Native Streaming Transforms for the Python SDK, >> > and I'm really looking forward to it. >> > >> > I've been contributing to Beam in smaller ways over the past several >> > months, and I've come to deeply appreciate how thoughtful and >> > welcoming this community is. I'm grateful for the chance to take on a >> > more sustained piece of work and to engage more closely with all of >> > you. >> > >> > === Project Summary === >> > >> > Beam's Python SDK is increasingly the first choice for ML and >> > data-intensive pipelines, making robust native streaming APIs more >> > important than ever. However, two essential streaming primitives, >> > UnboundedSource (https://github.com/apache/beam/issues/19137) and >> > Watch (https://github.com/apache/beam/issues/21521), remain >> > unavailable in Python despite being long established in the Java SDK. >> > Today, Python developers who need these capabilities either pull in >> > cross-language transforms that add Java dependencies and gRPC >> > overhead, or wrestle with low-level RestrictionTracker internals. >> > >> > This project aims to address the gap by porting both primitives >> > natively to Python on top of Beam's existing SDF framework. The >> > UnboundedSource wrapper adapts the legacy reader API into a Splittable >> > DoFn that handles checkpointing, watermark progression, deduplication, >> > and splitting. The Watch transform adds periodic polling with >> > composable termination conditions and stable dedup behavior. >> > >> > Porting both primitives natively abstracts that complexity behind >> > clean and Pythonic interfaces. For ML and data-intensive pipelines, >> > this means streaming ingestion from sources like Kafka and Pub/Sub, >> > continuous polling for new training data or model artifacts, and >> > event-driven feature pipelines can all be expressed natively in >> > Python, without cross-language overhead or low-level SDF code. >> > >> > === Planned Deliverables === >> > >> > * D1: A Python UnboundedSource API with its SDF-based wrapper >> > * D2: A native Watch transform with PollFn and TerminationConditions >> > * D3: A test suite spanning DirectRunner and Dataflow >> > * D4: Supporting documentation including docstrings, programming-guide >> > updates, and migration notes >> > >> > I'd genuinely value any feedback, suggestions, or historical context >> > from those of you who've worked on related areas. I'm sure there are >> > nuances and edge cases I haven't fully thought through yet, and I'd be >> > very glad to hear from anyone willing to share thoughts. >> > >> > Many thanks to Yi Hu for the mentorship, and to the wider community >> > for the warm reception so far. Looking forward to working alongside >> > you all over the coming months. >> > >> > Best regards, >> > Elia >> >
