cloud-fan commented on a change in pull request #29461:
URL: https://github.com/apache/spark/pull/29461#discussion_r484873531
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2548,6 +2553,10 @@ class Dataset[T] private[sql](
* the state. In addition, too late data older than watermark will be
dropped to avoid any
* possibility of duplicates.
*
+ * Note that for a streaming [[Dataset]], this method only returns distinct
rows only once
+ * regardless of the output mode, which the behavior may not be same with
using distinct in
+ * SQL statement against streaming [[Dataset]].
Review comment:
ditto, there is no `dropDuplicates` in SQL. If we want to highlight
differences between SQL API and Dataset API, let's update the doc of
`Dataset.distinct`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]