HeartSaVioR commented on code in PR #40561:
URL: https://github.com/apache/spark/pull/40561#discussion_r1160970094
##########
python/pyspark/sql/dataframe.py:
##########
@@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] =
None) -> "DataFrame":
jdf = self._jdf.dropDuplicates(self._jseq(subset))
return DataFrame(jdf, self.sparkSession)
+ def dropDuplicatesWithinWatermark(self, subset: Optional[List[str]] =
None) -> "DataFrame":
+ """Return a new :class:`DataFrame` with duplicate rows removed,
+ optionally only considering certain columns, within watermark.
+
+ For a static batch :class:`DataFrame`, it just drops duplicate rows.
For a streaming
+ :class:`DataFrame`, this will keep all data across triggers as
intermediate state to drop
+ duplicated rows. The state will be kept to guarantee the semantic,
"Events are deduplicated
+ as long as the time distance of earliest and latest events are smaller
than the delay
+ threshold of watermark." The watermark for the input
:class:`DataFrame` must be set via
+ :func:`withWatermark`. Users are encouraged to set the delay threshold
of watermark longer
Review Comment:
(If we want to implement the same behavior to batch query, we will have to
kick the part of "best effort" out as well in streaming. e.g. We deduplicate
the event whenever there is an existing state, which does not strictly say
they're within delay threshold. We evict the state at the end of processing,
hence we are accepting slightly more events to be deduplicated. That might be
better behavior for streaming, but if we want to guarantee the same result
between batch and streaming, the behavior must be deterministic.)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]