HeartSaVioR commented on code in PR #48124:
URL: https://github.com/apache/spark/pull/48124#discussion_r1857666519


##########
python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py:
##########
@@ -797,25 +777,22 @@ def check_results(batch_df, batch_id):
             import datetime
 
             if batch_id == 0:
-                assert set(
-                    batch_df.sort("outputTimestamp").select("outputTimestamp", 
"count").collect()
-                ) == {
-                    Row(outputTimestamp=datetime.datetime(1970, 1, 1, 0, 0, 
20), count=1),
-                }
+                assert batch_df.isEmpty()
             elif batch_id == 1:
+                # watermark is 25 - 5 = 20, no more event for eventTime=10

Review Comment:
   No, it's not.
   
   * The watermark for "late events" in batch ID "N" is the watermark for 
"eviction" in batch "N - 1". (default: 0)
   * The watermark for "eviction" in batch "N" is "calculated" based on the 
input data in batch "N - 1".
   
   That said, the watermark for "eviction" in batch 1 is, 15 (max event time 
from batch 0) - 5 = 10. The watermark for "late events" in batch 0 is, 0.
   
   With the same logic, the watermark for "eviction" in batch 2 is, 25 - 5 = 
20. The watermark for "late events" in batch 2 is, the watermark for "eviction" 
in batch 1, hence 10.
   
   Please update the code comment with the correction. I see both comments in 
batch 1 and 2 are incorrect, though somehow the result is the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to