Re: [GSoC 2026] Introduction from Elia: Native Streaming Transforms for the Python SDK

Chamikara Jayalath via dev Thu, 07 May 2026 11:05:06 -0700

Sounds good. Thanks for clarifying.

- Cham



On Thu, May 7, 2026, 8:26 AM Yi Hu <[email protected]> wrote:

> Hi Cham,
>
> I hope the original GSoC proposal [1] could answer your question
>
> > While Splittable DoFn has been introduced as a Beam primitive transform
> handling IO sources, UnboundedSource arguably remains an easier API for
> users to author their own IOs.
> > In the Java SDK, UnboundedSource/UnboundedReader has been
> (re)implemented as a wrapper of Splittable DoFn, we can follow the Java
> implementation and add it to Python.
>
> The goal is to make it easier for beginners to author production-ready
> (not necessarily very optimized) IOs easier. SDF is versatile but requires
> users to implement a couple of methods
> working closely, and remarkably self-checkpointing logic; requiring
> implementing TruncateOffsetTracker to be able to drain, etc.
> While UnboundedSource has a much gradual learning curve. Users can author
> a fairly production ready IO by implement only start, advance, (and split
> can just return null if not-splittable),
> finalizeCheckpoint (relay to the wrapper to handle self-checkpointing
> logic).
>
> I would consider UnboundedSource as a wrapper or utility class that
> complements SDF, as Java currently does (if ignoring runner specific
> override).
>
> Thanks,
>
> Yi
>
> [1] https://issues.apache.org/jira/browse/GSOC-315
>
> On Wed, May 6, 2026 at 9:03 PM Chamikara Jayalath via dev <
> [email protected]> wrote:
>
>> Thanks for the great write up.
>>
>> I have a high level question. Why would Python SDK need an
>> UnboundedSource / UnboundedReader API ?
>>
>> Java SDK has these since they predated the Splittable DoFn API. We added
>> the Unbounded wrapper (as you noted) to make old sources implemented in
>> this API available to runners that only support Splittable DoFn. Python
>> SDK, on the other hand, only supports portable runners, where Splittable
>> DoFn support is available and any new sources can be implemented using
>> Splittable DoFn API:
>> https://beam.apache.org/documentation/io/developing-io-overview/
>>
>> +1 for adding support for the Watch transform.
>>
>> Thanks,
>> Cham
>>
>> On Mon, May 4, 2026 at 5:24 AM Elia Liu <[email protected]> wrote:
>>
>>> I'd also like to share two documents with the community for feedback:
>>>
>>> • Design doc (also serving as my GSoC proposal):
>>>
>>> https://drive.google.com/file/d/1M_cg0iQh0D5AYyPRRoo88qcs2LKIcENW/view?usp=sharing
>>>
>>> • Proof of concept:
>>> https://github.com/Eliaaazzz/beam-native-streaming-poc
>>>
>>> The design doc and PoC cover the proposed SDF architecture, watermark
>>> and checkpoint semantics, and the test suite.
>>>
>>> Any feedback, suggestions, or concerns are very welcome, especially
>>> around watermark propagation, checkpoint encoding, and runner
>>> compatibility. Happy to iterate based on community feedback!
>>>
>>> On Fri, May 1, 2026 at 12:05 PM Elia LIU <[email protected]> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > I'm Elia, a final-year Computer Science student at the University of
>>> > Melbourne. I'm writing to introduce myself as one of this year's GSoC
>>> > contributors under Apache Beam. Over the next few months I'll be
>>> > working with Yi Hu on Native Streaming Transforms for the Python SDK,
>>> > and I'm really looking forward to it.
>>> >
>>> > I've been contributing to Beam in smaller ways over the past several
>>> > months, and I've come to deeply appreciate how thoughtful and
>>> > welcoming this community is. I'm grateful for the chance to take on a
>>> > more sustained piece of work and to engage more closely with all of
>>> > you.
>>> >
>>> > === Project Summary ===
>>> >
>>> > Beam's Python SDK is increasingly the first choice for ML and
>>> > data-intensive pipelines, making robust native streaming APIs more
>>> > important than ever. However, two essential streaming primitives,
>>> > UnboundedSource (https://github.com/apache/beam/issues/19137) and
>>> > Watch (https://github.com/apache/beam/issues/21521), remain
>>> > unavailable in Python despite being long established in the Java SDK.
>>> > Today, Python developers who need these capabilities either pull in
>>> > cross-language transforms that add Java dependencies and gRPC
>>> > overhead, or wrestle with low-level RestrictionTracker internals.
>>> >
>>> > This project aims to address the gap by porting both primitives
>>> > natively to Python on top of Beam's existing SDF framework. The
>>> > UnboundedSource wrapper adapts the legacy reader API into a Splittable
>>> > DoFn that handles checkpointing, watermark progression, deduplication,
>>> > and splitting. The Watch transform adds periodic polling with
>>> > composable termination conditions and stable dedup behavior.
>>> >
>>> > Porting both primitives natively abstracts that complexity behind
>>> > clean and Pythonic interfaces. For ML and data-intensive pipelines,
>>> > this means streaming ingestion from sources like Kafka and Pub/Sub,
>>> > continuous polling for new training data or model artifacts, and
>>> > event-driven feature pipelines can all be expressed natively in
>>> > Python, without cross-language overhead or low-level SDF code.
>>> >
>>> > === Planned Deliverables ===
>>> >
>>> > * D1: A Python UnboundedSource API with its SDF-based wrapper
>>> > * D2: A native Watch transform with PollFn and TerminationConditions
>>> > * D3: A test suite spanning DirectRunner and Dataflow
>>> > * D4: Supporting documentation including docstrings, programming-guide
>>> > updates, and migration notes
>>> >
>>> > I'd genuinely value any feedback, suggestions, or historical context
>>> > from those of you who've worked on related areas. I'm sure there are
>>> > nuances and edge cases I haven't fully thought through yet, and I'd be
>>> > very glad to hear from anyone willing to share thoughts.
>>> >
>>> > Many thanks to Yi Hu for the mentorship, and to the wider community
>>> > for the warm reception so far. Looking forward to working alongside
>>> > you all over the coming months.
>>> >
>>> > Best regards,
>>> > Elia
>>>
>>

Re: [GSoC 2026] Introduction from Elia: Native Streaming Transforms for the Python SDK

Reply via email to