Hi all, I'm Elia, a final-year Computer Science student at the University of Melbourne. I'm writing to introduce myself as one of this year's GSoC contributors under Apache Beam. Over the next few months I'll be working with Yi Hu on Native Streaming Transforms for the Python SDK, and I'm really looking forward to it.
I've been contributing to Beam in smaller ways over the past several months, and I've come to deeply appreciate how thoughtful and welcoming this community is. I'm grateful for the chance to take on a more sustained piece of work and to engage more closely with all of you. === Project Summary === Beam's Python SDK is increasingly the first choice for ML and data-intensive pipelines, making robust native streaming APIs more important than ever. However, two essential streaming primitives, UnboundedSource (https://github.com/apache/beam/issues/19137) and Watch (https://github.com/apache/beam/issues/21521), remain unavailable in Python despite being long established in the Java SDK. Today, Python developers who need these capabilities either pull in cross-language transforms that add Java dependencies and gRPC overhead, or wrestle with low-level RestrictionTracker internals. This project aims to address the gap by porting both primitives natively to Python on top of Beam's existing SDF framework. The UnboundedSource wrapper adapts the legacy reader API into a Splittable DoFn that handles checkpointing, watermark progression, deduplication, and splitting. The Watch transform adds periodic polling with composable termination conditions and stable dedup behavior. Porting both primitives natively abstracts that complexity behind clean and Pythonic interfaces. For ML and data-intensive pipelines, this means streaming ingestion from sources like Kafka and Pub/Sub, continuous polling for new training data or model artifacts, and event-driven feature pipelines can all be expressed natively in Python, without cross-language overhead or low-level SDF code. === Planned Deliverables === * D1: A Python UnboundedSource API with its SDF-based wrapper * D2: A native Watch transform with PollFn and TerminationConditions * D3: A test suite spanning DirectRunner and Dataflow * D4: Supporting documentation including docstrings, programming-guide updates, and migration notes I'd genuinely value any feedback, suggestions, or historical context from those of you who've worked on related areas. I'm sure there are nuances and edge cases I haven't fully thought through yet, and I'd be very glad to hear from anyone willing to share thoughts. Many thanks to Yi Hu for the mentorship, and to the wider community for the warm reception so far. Looking forward to working alongside you all over the coming months. Best regards, Elia
