huanliwang-db commented on code in PR #54085:
URL: https://github.com/apache/spark/pull/54085#discussion_r2757504719
##########
python/pyspark/sql/datasource.py:
##########
@@ -714,9 +715,37 @@ def initialOffset(self) -> dict:
messageParameters={"feature": "initialOffset"},
)
- def latestOffset(self) -> dict:
+ def latestOffset(self, start: dict, limit: ReadLimit) -> dict:
"""
- Returns the most recent offset available.
+ Returns the most recent offset available given a read limit. The start
offset can be used
+ to figure out how much new data should be read given the limit.
+
+ The `start` will be provided from the return value of
:meth:`initialOffset()` for
+ the very first micro-batch, and the offset continues from the last
micro-batch for the
+ following. The source can return the same offset as start offset if
there is no data to
Review Comment:
maybe "and for subsequent micro-batches, the start offset is the ending
offset from the previous micro-batch." is better iirc?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]