HeartSaVioR commented on code in PR #40561:
URL: https://github.com/apache/spark/pull/40561#discussion_r1159396995
##########
python/pyspark/sql/dataframe.py:
##########
@@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] =
None) -> "DataFrame":
jdf = self._jdf.dropDuplicates(self._jseq(subset))
return DataFrame(jdf, self.sparkSession)
+ def dropDuplicatesWithinWatermark(self, subset: Optional[List[str]] =
None) -> "DataFrame":
Review Comment:
Grouping to the above Scala API in question; it is important to decide which
one we want this API to be consistent with. I'm seeing this API as a sibling of
existing dropDuplicates API, which I wish to provide the same UX.
Say, for the case of batch query, someone can add the postfix of the name
from existing query (dropDuplicates -> dropDuplicatesWithinWatermark) and
expect the query to be run without any issue. It is respected now regardless of
the type of parameter they use.
For the case of streaming query, they might already have a query dealing
with dropDuplicates but had been suffered from the issue I've mentioned in the
section `why are the changes needed?`. In most cases, we expect them to just
add the postfix of the name from existing query and discard checkpoint, and
then enjoy that the problem gets fixed.
I hope this is enough rationale to make the new API be consistent with
dropDuplicates.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]