[GitHub] [spark] HeartSaVioR commented on a change in pull request #29461: [SPARK-32456][SS][FOLLOWUP] Update doc to note about using SQL statement with streaming Dataset

GitBox Thu, 03 Sep 2020 23:49:17 -0700


HeartSaVioR commented on a change in pull request #29461:
URL: https://github.com/apache/spark/pull/29461#discussion_r483420929




##########
File path: docs/structured-streaming-programming-guide.md
##########
@@ -861,6 +861,10 @@ isStreaming(df)
 </div>
 </div>
 
+You may want to check the logical plan of the query, as Spark converts the 
operation into another operation, which includes adding streaming aggregation. 
(e.g. count, distinct, union, etc.)

Review comment:
       The thing is whether Spark injects streaming aggregation which end users 
have to maintain or not, and that can be checked by looking into logical plan, 
right? I didn't mean they need to find the distinct in logical plan and how 
Spark changes the operation. They just need to check for stateful operations.
   
   SQL distinct and Dataset dropDuplicate aren't the only difference. SQL union 
and Dataset union are also different. The cases can increase and decrease 
according to the Spark catalyst rules, which is not the thing we can ensure the 
doc be in sync.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on a change in pull request #29461: [SPARK-32456][SS][FOLLOWUP] Update doc to note about using SQL statement with streaming Dataset

Reply via email to