HeartSaVioR commented on code in PR #54085:
URL: https://github.com/apache/spark/pull/54085#discussion_r2758900088


##########
python/pyspark/sql/datasource.py:
##########
@@ -714,9 +715,37 @@ def initialOffset(self) -> dict:
             messageParameters={"feature": "initialOffset"},
         )
 
-    def latestOffset(self) -> dict:
+    def latestOffset(self, start: dict, limit: ReadLimit) -> dict:
         """
-        Returns the most recent offset available.
+        Returns the most recent offset available given a read limit. The start 
offset can be used
+        to figure out how much new data should be read given the limit.
+
+        The `start` will be provided from the return value of 
:meth:`initialOffset()` for
+        the very first micro-batch, and the offset continues from the last 
micro-batch for the
+        following. The source can return the same offset as start offset if 
there is no data to

Review Comment:
   I'm slightly not in favor of coupling the contract with the engine's 
behavior, since it could limit ourselves. (I found myself doing this and I'm 
even OK with omitting it.) But the explanation from your comment is unlikely to 
change, so makes sense to me. Thanks for the suggestion.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to