[GitHub] [spark] rangadi commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

via GitHub Fri, 07 Apr 2023 09:03:28 -0700


rangadi commented on code in PR #40561:
URL: https://github.com/apache/spark/pull/40561#discussion_r1160800555



##########
python/pyspark/sql/dataframe.py:
##########
@@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] = 
None) -> "DataFrame":
             jdf = self._jdf.dropDuplicates(self._jseq(subset))
         return DataFrame(jdf, self.sparkSession)
 
+    def dropDuplicatesWithinWatermark(self, subset: Optional[List[str]] = 
None) -> "DataFrame":
+        """Return a new :class:`DataFrame` with duplicate rows removed,
+         optionally only considering certain columns, within watermark.
+
+        For a static batch :class:`DataFrame`, it just drops duplicate rows. 
For a streaming
+        :class:`DataFrame`, this will keep all data across triggers as 
intermediate state to drop
+        duplicated rows. The state will be kept to guarantee the semantic, 
"Events are deduplicated
+        as long as the time distance of earliest and latest events are smaller 
than the delay
+        threshold of watermark." The watermark for the input 
:class:`DataFrame` must be set via
+        :func:`withWatermark`. Users are encouraged to set the delay threshold 
of watermark longer

Review Comment:
   dropDuplicates does not support exact same output between batch and 
streaming either. No stateful operation guarantees in the precense of late 
records. What is the difference here? Better to support batch in the same 
manner as dropDuplicates(). 
   I don't think it is a good UX for customer to get errors then we fix it by 
relaxing. 
   But I will leave the decision to you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] rangadi commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

Reply via email to