xuanyuanking commented on a change in pull request #29461:
URL: https://github.com/apache/spark/pull/29461#discussion_r483414767



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2525,14 +2525,19 @@ class Dataset[T] private[sql](
 
   /**
    * Returns a new Dataset that contains only the unique rows from this 
Dataset.
-   * This is an alias for `distinct`.
+   * This is an alias for `distinct` on batch [[Dataset]]. For streaming 
[[Dataset]], it would show
+   * slightly different behavior. (see below)
    *
    * For a static batch [[Dataset]], it just drops duplicate rows. For a 
streaming [[Dataset]], it
    * will keep all data across triggers as intermediate state to drop 
duplicates rows. You can use
    * [[withWatermark]] to limit how late the duplicate data can be and system 
will accordingly limit
    * the state. In addition, too late data older than watermark will be 
dropped to avoid any
    * possibility of duplicates.
    *
+   * Note that for a streaming [[Dataset]], this method only returns distinct 
rows only once
+   * regardless of the output mode, which the behavior may not be same with 
using distinct in

Review comment:
       +1 for the second version, actually the original comments also 
confusing, we need to emphasize the distinct clause in SQL.

##########
File path: docs/structured-streaming-programming-guide.md
##########
@@ -861,6 +861,10 @@ isStreaming(df)
 </div>
 </div>
 
+You may want to check the logical plan of the query, as Spark converts the 
operation into another operation, which includes adding streaming aggregation. 
(e.g. count, distinct, union, etc.)

Review comment:
       I think the operation converting is internal behavior, maybe it's not 
clear enough for asking the end-user to check it. How about we just comment on 
the behavior difference between SQL distinct and dataset dropDuplicate, WDYT?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to