HeartSaVioR commented on code in PR #48124:
URL: https://github.com/apache/spark/pull/48124#discussion_r1857666519
##########
python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py:
##########
@@ -797,25 +777,22 @@ def check_results(batch_df, batch_id):
import datetime
if batch_id == 0:
- assert set(
- batch_df.sort("outputTimestamp").select("outputTimestamp",
"count").collect()
- ) == {
- Row(outputTimestamp=datetime.datetime(1970, 1, 1, 0, 0,
20), count=1),
- }
+ assert batch_df.isEmpty()
elif batch_id == 1:
+ # watermark is 25 - 5 = 20, no more event for eventTime=10
Review Comment:
No, it's not.
* The watermark for "late events" in batch ID "N" is the watermark for
"eviction" in batch "N - 1". (default: 0)
* The watermark for "eviction" in batch "N" is "calculated" based on the
input data in batch "N - 1".
That said, the watermark for "eviction" in batch 1 is, 15 (max event time
from batch 0) - 5 = 10. The watermark for "late events" in batch 0 is, 0.
With the same logic, the watermark for "eviction" in batch 2 is, 25 - 5 =
20. The watermark for "late events" in batch 2 is, the watermark for "eviction"
in batch 1, hence 10.
Please update the code comment with the correction. I see both comments in
batch 1 and 2 are incorrect.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]