[PR] [SPARK-56367][SS][PYTHON][DOCS] Fix latestOffset docstring, update tutorial signature, and add Trigger.AvailableNow documentation [spark]

via GitHub Fri, 12 Jun 2026 11:06:29 -0700


brijrajk opened a new pull request, #56473:
URL: https://github.com/apache/spark/pull/56473


   ### What changes were proposed in this pull request?
   
   Three documentation fixes in the PySpark streaming data source API:
   
   **1. Fix docstring bug in `DataSourceStreamReader.latestOffset()` 
(`datasource.py:759`)**
   
   `limit.maxRows` → `limit.max_rows`
   
   The `ReadMaxRows` dataclass uses Python snake_case `max_rows`. Users copying 
this example would get `AttributeError: 'ReadMaxRows' object has no attribute 
'maxRows'` at runtime.
   
   **2. Update outdated `latestOffset` signature in tutorial 
(`python_data_source.rst`)**
   
   `def latestOffset(self) -> dict:` → `def latestOffset(self, start: dict, 
limit: ReadLimit) -> dict:`
   
   The parameterless signature is deprecated since SPARK-55304. The tutorial 
should guide new users toward the recommended signature that supports admission 
control. Type annotation for `limit` added per reviewer feedback on the prior 
PR (#55227).
   
   **3. Add `Trigger.AvailableNow` documentation section 
(`python_data_source.rst`)**
   
   New section showing how to implement `SupportsTriggerAvailableNow` for 
finite batch processing — how `prepareForTriggerAvailableNow()` captures the 
target offset at query start and how `latestOffset()` should respect it to 
ensure the query terminates.
   
   ### Why are the changes needed?
   
   - Fix 1: Runtime bug — users copying the docstring will get `AttributeError`
   - Fix 2: Tutorial teaches deprecated API instead of recommended approach
   - Fix 3: `Trigger.AvailableNow` support was undiscoverable — no tutorial 
guidance existed
   
   These issues were originally identified during review of SPARK-55450 and 
tracked in SPARK-56367.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. Documentation and docstring fixes only.
   
   ### How was this patch tested?
   
   No code change — documentation only. Verified:
   - `ReadMaxRows` dataclass uses `max_rows` field name
   - `SupportsTriggerAvailableNow` and `prepareForTriggerAvailableNow()` exist 
in `python/pyspark/sql/streaming/datasource.py`
   - `latestOffset(self, start, limit)` is the recommended signature per 
SPARK-55304
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude (Anthropic)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56367][SS][PYTHON][DOCS] Fix latestOffset docstring, update tutorial signature, and add Trigger.AvailableNow documentation [spark]

Reply via email to