shaheeramjad opened a new pull request, #37747: URL: https://github.com/apache/beam/pull/37747
This PR introduces the foundational API for unbounded sources in the Apache Beam Python SDK, enabling support for continuous data streams. This is a **30-40% implementation** focusing on core abstractions, with SDF integration and full Read transform support planned for future work. Addresses #19137 ## Changes ### Added to `iobase.py` #### `CheckpointMark` Abstract base class representing a position/state in an unbounded stream - Supports checkpoint-based recovery for fault tolerance - Implementations can acknowledge processed data to external systems #### `UnboundedReader` Abstract reader protocol for continuous data sources - `start()` / `advance()` - Iterator-like interface for fetching records - `get_current()` / `get_current_timestamp()` - Access to data and event time - `get_watermark()` - Tracks progress with low watermark on future timestamps - `get_checkpoint_mark()` - Creates resumable checkpoints - `close()` - Resource cleanup #### `UnboundedSource` Main abstraction for unbounded data sources - `reader(checkpoint)` - Creates reader, optionally resuming from checkpoint - `split(desired_num_splits)` - Enables parallel reading across multiple workers - `is_bounded()` - Returns `False` to distinguish from bounded sources ### Testing Added `UnboundedSourceTest` class with 3 test cases: - `test_checkpoint_mark_finalize()` - Validates CheckpointMark lifecycle - `test_unbounded_source_basic_interface()` - Verifies reader operations - `test_unbounded_source_split_default()` - Confirms default split behavior ## Scope ### Included in this PR (30-40%) - Core API classes with clear contracts and documentation - Basic test coverage for API interface - `CHANGES.md` entry ### Future Work (60-70%) - SDF (Splittable DoFn) wrapper classes for runner integration - Integration with `beam.io.Read()` transform - Restriction tracker and provider for advanced splitting - Full pipeline integration tests - Real-world source examples (Kafka, Pub/Sub, etc.) ## Motivation This foundational work establishes the Python API contract for unbounded sources, aligning with Apache Beam's Java SDK design. It provides a clear path for implementing custom unbounded sources like message queues, database change streams, and sensor feeds. ## Checklist - [x] Mentioned the appropriate issue in description - [x] Updated `CHANGES.md` with noteworthy changes - [ ] Added examples (deferred to future PR when Read transform integration is complete) ## Notes This PR lays the groundwork for [Issue #19137](https://github.com/apache/beam/issues/19137). Subsequent PRs will add: 1. SDF-based implementation wrapper 2. Read transform integration 3. Example unbounded source implementations 4. Comprehensive integration tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
