[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

via GitHub Wed, 05 Apr 2023 22:58:34 -0700


HeartSaVioR commented on code in PR #40561:
URL: https://github.com/apache/spark/pull/40561#discussion_r1159313795



##########
python/pyspark/sql/dataframe.py:
##########
@@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] = 
None) -> "DataFrame":
             jdf = self._jdf.dropDuplicates(self._jseq(subset))
         return DataFrame(jdf, self.sparkSession)
 
+    def dropDuplicatesWithinWatermark(self, subset: Optional[List[str]] = 
None) -> "DataFrame":
+        """Return a new :class:`DataFrame` with duplicate rows removed,
+         optionally only considering certain columns, within watermark.
+
+        For a static batch :class:`DataFrame`, it just drops duplicate rows. 
For a streaming
+        :class:`DataFrame`, this will keep all data across triggers as 
intermediate state to drop
+        duplicated rows. The state will be kept to guarantee the semantic, 
"Events are deduplicated
+        as long as the time distance of earliest and latest events are smaller 
than the delay
+        threshold of watermark." The watermark for the input 
:class:`DataFrame` must be set via
+        :func:`withWatermark`. Users are encouraged to set the delay threshold 
of watermark longer
+        than max timestamp differences among duplicated events. In addition, 
too late data older
+        than watermark will be dropped.
+
+         .. versionadded:: 3.5.0
+
+         Parameters
+         ----------
+         subset : List of column names, optional
+             List of columns to use for duplicate comparison (default All 
columns).
+
+         Returns
+         -------
+         :class:`DataFrame`
+             DataFrame without duplicates.
+
+         Examples
+         --------
+         >>> from pyspark.sql import Row
+         >>> df = spark.createDataFrame([
+         ...     Row(name='Alice', age=5, height=80),
+         ...     Row(name='Alice', age=5, height=80),
+         ...     Row(name='Alice', age=10, height=80)
+         ... ])
+
+         Deduplicate the same rows.
+
+         >>> df.dropDuplicatesWithinWatermark().show()

Review Comment:
   I can't find the code example to add the code in example section but not 
evaluating as test. It would be tricky if we have to run streaming query from 
there and also have to validate something.
   
   @HyukjinKwon Would you mind if I ask for kindly guiding about pyspark API 
doc? Are these codes in example section still executed as tests? If they are, 
is there a way to prevent it for several lines or per method? Thanks in advance!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

Reply via email to