Jitesh Soni created SPARK-54305:
-----------------------------------
Summary: Add admission control support to Python DataSource
streaming API
Key: SPARK-54305
URL: https://issues.apache.org/jira/browse/SPARK-54305
Project: Spark
Issue Type: Improvement
Components: PySpark, SQL, Structured Streaming
Affects Versions: 4.0.1, 4.0.0, 4.1.0, 4.2.0
Reporter: Jitesh Soni
Fix For: 4.1.0, 4.2.0, 4.0.0
h2. Problem
Python streaming data sources cannot control microbatch sizes because
DataSourceStreamReader.latestOffset() has no parameters to receive configured
limits. This forces Python sources to either process all available data
(unpredictable resource usage) or artificially limit offsets (risking data
loss).
Scala sources can implement SupportsAdmissionControl to properly control batch
sizes, but Python sources lack this capability.
h2. Solution
Extend the Python DataSource API to support admission control by:
1. Enhanced Python API: Update DataSourceStreamReader.latestOffset() to accept
optional start_offset and read_limit parameters
2. Scala Bridge: Modify PythonMicroBatchStream to implement
SupportsAdmissionControl
3. Serialization: Add ReadLimit serialization in PythonStreamingSourceRunner
4. Python Worker: Enhance python_streaming_source_runner.py to deserialize and
pass parameters
5. Monitoring: Add optional reportLatestOffset() method for observability
h2. Benefits
- Predictable performance with controlled batch sizes
- Rate limiting and backpressure support
- Feature parity with Scala DataSource capabilities
- Full backward compatibility (all parameters optional)
- Production-ready streaming sources
h2. Pull Request
PR: [https://github.com/apache/spark/pull/53001]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]