Hi all,

I'm Elia, a final-year Computer Science student at the University of
Melbourne. I'm writing to introduce myself as one of this year's GSoC
contributors under Apache Beam. Over the next few months I'll be
working with Yi Hu on Native Streaming Transforms for the Python SDK,
and I'm really looking forward to it.

I've been contributing to Beam in smaller ways over the past several
months, and I've come to deeply appreciate how thoughtful and
welcoming this community is. I'm grateful for the chance to take on a
more sustained piece of work and to engage more closely with all of
you.

=== Project Summary ===

Beam's Python SDK is increasingly the first choice for ML and
data-intensive pipelines, making robust native streaming APIs more
important than ever. However, two essential streaming primitives,
UnboundedSource (https://github.com/apache/beam/issues/19137) and
Watch (https://github.com/apache/beam/issues/21521), remain
unavailable in Python despite being long established in the Java SDK.
Today, Python developers who need these capabilities either pull in
cross-language transforms that add Java dependencies and gRPC
overhead, or wrestle with low-level RestrictionTracker internals.

This project aims to address the gap by porting both primitives
natively to Python on top of Beam's existing SDF framework. The
UnboundedSource wrapper adapts the legacy reader API into a Splittable
DoFn that handles checkpointing, watermark progression, deduplication,
and splitting. The Watch transform adds periodic polling with
composable termination conditions and stable dedup behavior.

Porting both primitives natively abstracts that complexity behind
clean and Pythonic interfaces. For ML and data-intensive pipelines,
this means streaming ingestion from sources like Kafka and Pub/Sub,
continuous polling for new training data or model artifacts, and
event-driven feature pipelines can all be expressed natively in
Python, without cross-language overhead or low-level SDF code.

=== Planned Deliverables ===

* D1: A Python UnboundedSource API with its SDF-based wrapper
* D2: A native Watch transform with PollFn and TerminationConditions
* D3: A test suite spanning DirectRunner and Dataflow
* D4: Supporting documentation including docstrings, programming-guide
updates, and migration notes

I'd genuinely value any feedback, suggestions, or historical context
from those of you who've worked on related areas. I'm sure there are
nuances and edge cases I haven't fully thought through yet, and I'd be
very glad to hear from anyone willing to share thoughts.

Many thanks to Yi Hu for the mentorship, and to the wider community
for the warm reception so far. Looking forward to working alongside
you all over the coming months.

Best regards,
Elia

Reply via email to