Jitesh Soni created SPARK-54305:
-----------------------------------

             Summary:   Add admission control support to Python DataSource 
streaming API
                 Key: SPARK-54305
                 URL: https://issues.apache.org/jira/browse/SPARK-54305
             Project: Spark
          Issue Type: Improvement
          Components: PySpark, SQL, Structured Streaming
    Affects Versions: 4.0.1, 4.0.0, 4.1.0, 4.2.0
            Reporter: Jitesh Soni
             Fix For: 4.1.0, 4.2.0, 4.0.0


h2. Problem

Python streaming data sources cannot control microbatch sizes because 
DataSourceStreamReader.latestOffset() has no parameters to receive configured 
limits. This forces Python sources to either process all available data 
(unpredictable resource usage) or artificially limit offsets (risking data 
loss).

Scala sources can implement SupportsAdmissionControl to properly control batch 
sizes, but Python sources lack this capability.
 
 h2.  Solution

Extend the Python DataSource API to support admission control by:

1. Enhanced Python API: Update DataSourceStreamReader.latestOffset() to accept 
optional start_offset and read_limit parameters
2. Scala Bridge: Modify PythonMicroBatchStream to implement 
SupportsAdmissionControl
3. Serialization: Add ReadLimit serialization in PythonStreamingSourceRunner
4. Python Worker: Enhance python_streaming_source_runner.py to deserialize and 
pass parameters
5. Monitoring: Add optional reportLatestOffset() method for observability
 
h2. Benefits

 - Predictable performance with controlled batch sizes
 - Rate limiting and backpressure support
 - Feature parity with Scala DataSource capabilities
 - Full backward compatibility (all parameters optional)
 - Production-ready streaming sources

 
h2. Pull Request

PR: [https://github.com/apache/spark/pull/53001]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to