[PR] Add foundational API for Unbounded Sources in Python SDK [beam]

via GitHub Tue, 03 Mar 2026 04:19:34 -0800


shaheeramjad opened a new pull request, #37747:
URL: https://github.com/apache/beam/pull/37747


   This PR introduces the foundational API for unbounded sources in the Apache 
Beam Python SDK, enabling support for continuous data streams. This is a 
**30-40% implementation** focusing on core abstractions, with SDF integration 
and full Read transform support planned for future work.
   
   Addresses #19137
   
   ## Changes
   
   ### Added to `iobase.py`
   
   #### `CheckpointMark`
   Abstract base class representing a position/state in an unbounded stream
   - Supports checkpoint-based recovery for fault tolerance
   - Implementations can acknowledge processed data to external systems
   
   #### `UnboundedReader`
   Abstract reader protocol for continuous data sources
   - `start()` / `advance()` - Iterator-like interface for fetching records
   - `get_current()` / `get_current_timestamp()` - Access to data and event time
   - `get_watermark()` - Tracks progress with low watermark on future timestamps
   - `get_checkpoint_mark()` - Creates resumable checkpoints
   - `close()` - Resource cleanup
   
   #### `UnboundedSource`
   Main abstraction for unbounded data sources
   - `reader(checkpoint)` - Creates reader, optionally resuming from checkpoint
   - `split(desired_num_splits)` - Enables parallel reading across multiple 
workers
   - `is_bounded()` - Returns `False` to distinguish from bounded sources
   
   ### Testing
   
   Added `UnboundedSourceTest` class with 3 test cases:
   - `test_checkpoint_mark_finalize()` - Validates CheckpointMark lifecycle
   - `test_unbounded_source_basic_interface()` - Verifies reader operations
   - `test_unbounded_source_split_default()` - Confirms default split behavior
   
   ## Scope
   
   ### Included in this PR (30-40%)
   - Core API classes with clear contracts and documentation
   - Basic test coverage for API interface
   - `CHANGES.md` entry
   
   ### Future Work (60-70%)
   - SDF (Splittable DoFn) wrapper classes for runner integration
   - Integration with `beam.io.Read()` transform
   - Restriction tracker and provider for advanced splitting
   - Full pipeline integration tests
   - Real-world source examples (Kafka, Pub/Sub, etc.)
   
   ## Motivation
   
   This foundational work establishes the Python API contract for unbounded 
sources, aligning with Apache Beam's Java SDK design. It provides a clear path 
for implementing custom unbounded sources like message queues, database change 
streams, and sensor feeds.
   
   ## Checklist
   
   - [x] Mentioned the appropriate issue in description
   - [x] Updated `CHANGES.md` with noteworthy changes
   - [ ] Added examples (deferred to future PR when Read transform integration 
is complete)
   
   ## Notes
   
   This PR lays the groundwork for [Issue 
#19137](https://github.com/apache/beam/issues/19137). Subsequent PRs will add:
   1. SDF-based implementation wrapper
   2. Read transform integration
   3. Example unbounded source implementations
   4. Comprehensive integration tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Add foundational API for Unbounded Sources in Python SDK [beam]

Reply via email to