Re: [GSoC 2026] Introduction from Elia: Native Streaming Transforms for the Python SDK

Yi Hu via dev Thu, 07 May 2026 08:27:37 -0700

Hi Cham,

I hope the original GSoC proposal [1] could answer your question


> While Splittable DoFn has been introduced as a Beam primitive transform
handling IO sources, UnboundedSource arguably remains an easier API for
users to author their own IOs.
> In the Java SDK, UnboundedSource/UnboundedReader has been (re)implemented
as a wrapper of Splittable DoFn, we can follow the Java implementation and
add it to Python.

The goal is to make it easier for beginners to author production-ready (not
necessarily very optimized) IOs easier. SDF is versatile but requires users
to implement a couple of methods
working closely, and remarkably self-checkpointing logic; requiring
implementing TruncateOffsetTracker to be able to drain, etc.
While UnboundedSource has a much gradual learning curve. Users can author a
fairly production ready IO by implement only start, advance, (and split can
just return null if not-splittable),
finalizeCheckpoint (relay to the wrapper to handle self-checkpointing
logic).

I would consider UnboundedSource as a wrapper or utility class that
complements SDF, as Java currently does (if ignoring runner specific
override).

Thanks,

Yi

[1] https://issues.apache.org/jira/browse/GSOC-315

On Wed, May 6, 2026 at 9:03 PM Chamikara Jayalath via dev <
[email protected]> wrote:

> Thanks for the great write up.
>
> I have a high level question. Why would Python SDK need an UnboundedSource
> / UnboundedReader API ?
>
> Java SDK has these since they predated the Splittable DoFn API. We added
> the Unbounded wrapper (as you noted) to make old sources implemented in
> this API available to runners that only support Splittable DoFn. Python
> SDK, on the other hand, only supports portable runners, where Splittable
> DoFn support is available and any new sources can be implemented using
> Splittable DoFn API:
> https://beam.apache.org/documentation/io/developing-io-overview/
>
> +1 for adding support for the Watch transform.
>
> Thanks,
> Cham
>
> On Mon, May 4, 2026 at 5:24 AM Elia Liu <[email protected]> wrote:
>
>> I'd also like to share two documents with the community for feedback:
>>
>> • Design doc (also serving as my GSoC proposal):
>>
>> https://drive.google.com/file/d/1M_cg0iQh0D5AYyPRRoo88qcs2LKIcENW/view?usp=sharing
>>
>> • Proof of concept:
>> https://github.com/Eliaaazzz/beam-native-streaming-poc
>>
>> The design doc and PoC cover the proposed SDF architecture, watermark
>> and checkpoint semantics, and the test suite.
>>
>> Any feedback, suggestions, or concerns are very welcome, especially
>> around watermark propagation, checkpoint encoding, and runner
>> compatibility. Happy to iterate based on community feedback!
>>
>> On Fri, May 1, 2026 at 12:05 PM Elia LIU <[email protected]> wrote:
>> >
>> > Hi all,
>> >
>> > I'm Elia, a final-year Computer Science student at the University of
>> > Melbourne. I'm writing to introduce myself as one of this year's GSoC
>> > contributors under Apache Beam. Over the next few months I'll be
>> > working with Yi Hu on Native Streaming Transforms for the Python SDK,
>> > and I'm really looking forward to it.
>> >
>> > I've been contributing to Beam in smaller ways over the past several
>> > months, and I've come to deeply appreciate how thoughtful and
>> > welcoming this community is. I'm grateful for the chance to take on a
>> > more sustained piece of work and to engage more closely with all of
>> > you.
>> >
>> > === Project Summary ===
>> >
>> > Beam's Python SDK is increasingly the first choice for ML and
>> > data-intensive pipelines, making robust native streaming APIs more
>> > important than ever. However, two essential streaming primitives,
>> > UnboundedSource (https://github.com/apache/beam/issues/19137) and
>> > Watch (https://github.com/apache/beam/issues/21521), remain
>> > unavailable in Python despite being long established in the Java SDK.
>> > Today, Python developers who need these capabilities either pull in
>> > cross-language transforms that add Java dependencies and gRPC
>> > overhead, or wrestle with low-level RestrictionTracker internals.
>> >
>> > This project aims to address the gap by porting both primitives
>> > natively to Python on top of Beam's existing SDF framework. The
>> > UnboundedSource wrapper adapts the legacy reader API into a Splittable
>> > DoFn that handles checkpointing, watermark progression, deduplication,
>> > and splitting. The Watch transform adds periodic polling with
>> > composable termination conditions and stable dedup behavior.
>> >
>> > Porting both primitives natively abstracts that complexity behind
>> > clean and Pythonic interfaces. For ML and data-intensive pipelines,
>> > this means streaming ingestion from sources like Kafka and Pub/Sub,
>> > continuous polling for new training data or model artifacts, and
>> > event-driven feature pipelines can all be expressed natively in
>> > Python, without cross-language overhead or low-level SDF code.
>> >
>> > === Planned Deliverables ===
>> >
>> > * D1: A Python UnboundedSource API with its SDF-based wrapper
>> > * D2: A native Watch transform with PollFn and TerminationConditions
>> > * D3: A test suite spanning DirectRunner and Dataflow
>> > * D4: Supporting documentation including docstrings, programming-guide
>> > updates, and migration notes
>> >
>> > I'd genuinely value any feedback, suggestions, or historical context
>> > from those of you who've worked on related areas. I'm sure there are
>> > nuances and edge cases I haven't fully thought through yet, and I'd be
>> > very glad to hear from anyone willing to share thoughts.
>> >
>> > Many thanks to Yi Hu for the mentorship, and to the wider community
>> > for the warm reception so far. Looking forward to working alongside
>> > you all over the coming months.
>> >
>> > Best regards,
>> > Elia
>>
>

Re: [GSoC 2026] Introduction from Elia: Native Streaming Transforms for the Python SDK

Reply via email to