HeartSaVioR commented on a change in pull request #24890: 
[SPARK-28074][DOC][SS] Document caveats on using multiple stateful operations 
in single query
URL: https://github.com/apache/spark/pull/24890#discussion_r295074607
 
 

 ##########
 File path: docs/structured-streaming-programming-guide.md
 ##########
 @@ -3146,6 +3146,17 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
       - After `coalesce`, the number of (reduced) tasks will be kept unless 
another shuffle happens.
   - `spark.sql.streaming.stateStore.providerClass`: To read the previous state 
of the query properly, the class of state store provider should be unchanged.
   - `spark.sql.streaming.multipleWatermarkPolicy`: Modification of this would 
lead inconsistent watermark value when query contains multiple watermarks, 
hence the policy should be unchanged.
+- Structured Streaming uses `global watermark` which might impact on query 
having multiple stateful operations.
+  - Stateful operators: aggregation, deduplication, stream-stream join, 
(flat)mapGroupsWithState
+  - You should be able to answer below questions for your query to get correct 
outputs:
+    - How global watermark is calculated on your query?
+    - How global watermark is applied to each stateful operator?
+    - Is there any intermediate output being discarded as "late input" due to 
watermark?
+  - Fail to answer above questions might lead to incorrect outputs - e.g. 
intermediate outputs being discarded. 
 
 Review comment:
   https://issues.apache.org/jira/browse/SPARK-28094 is the example on problem, 
as I tried to help resolving that issue (in user mailing list) and found the 
correctness issue.
   
   > If my answers are what then what do I do?
   
   The set of questions are asking theirselves as whether they're fully 
understanding the details of watermark, because it could bring "unexpected" 
discard on rows between stateful operators, say, correctness issue.
   
   So it's like an exam to prevent end users to "just do it as Spark allows it" 
without fully understanding what they're doing. If they can't answer with their 
query, they should find alternative approach.
   
   > what do you mean fail to answer the question?
   
   Literally. If they are having any small unsure thing about answering these 
question they fail. I know it's too hard on end users, but well, better than 
realizing incorrect outputs in production env.
   
   > can this be more direct? what's an example of a problem, cause and 
solution?
   
   I'm happy to do it but not sure how much I need to explain with details. If 
we want to have detailed explanation like providing example, maybe better to 
have individual chapter. It might be  long enough to use as a blog post. This 
is an advanced topic - the book "Streaming Systems" assigns a chapter for 
watermark - so may not be feasible to add the details to the guide page, but 
due to the importance we should mention it. So that looks to be a dilemma.
   
   To be honest, there's no "solution", only workaround, since it's due to lack 
of feature - watermark propagation and stateful operator-wise watermark. 
https://issues.apache.org/jira/browse/SPARK-26655 tries to address this, but 
not much efforts are here due to lack of interests.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to