HeartSaVioR commented on a change in pull request #24890: 
[SPARK-28074][DOC][SS] Document caveats on using multiple stateful operations 
in single query
URL: https://github.com/apache/spark/pull/24890#discussion_r295094018
 
 

 ##########
 File path: docs/structured-streaming-programming-guide.md
 ##########
 @@ -3146,6 +3146,17 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
       - After `coalesce`, the number of (reduced) tasks will be kept unless 
another shuffle happens.
   - `spark.sql.streaming.stateStore.providerClass`: To read the previous state 
of the query properly, the class of state store provider should be unchanged.
   - `spark.sql.streaming.multipleWatermarkPolicy`: Modification of this would 
lead inconsistent watermark value when query contains multiple watermarks, 
hence the policy should be unchanged.
+- Structured Streaming uses `global watermark` which might impact on query 
having multiple stateful operations.
+  - Stateful operators: aggregation, deduplication, stream-stream join, 
(flat)mapGroupsWithState
+  - You should be able to answer below questions for your query to get correct 
outputs:
+    - How global watermark is calculated on your query?
+    - How global watermark is applied to each stateful operator?
+    - Is there any intermediate output being discarded as "late input" due to 
watermark?
+  - Fail to answer above questions might lead to incorrect outputs - e.g. 
intermediate outputs being discarded. 
 
 Review comment:
   > what type of incorrect usage may cause what problem
   
   Incorrect usages can't be collected out as some types - Spark allows 
arbitrary operations on state via map/flatMapGroupsWithState, so can't 
enumerate the overall usages and picks up incorrect usages. Users are open to 
encounter the situation whenever they're using multiple stateful operations in 
a query, according to their query.
   
   One symptom would be incorrect outputs as SPARK-26655 represents, but Spark 
doesn't provide any information to trace back (dig) the issue so I had to 
analyze the query via answering these questions for one of query in SPARK-26655.
   
   Maybe we shouldn't refer the case as "incorrect usage" even they've 
encountered, since the issue is coming from design and limitation on watermark 
in structured streaming. It's not their fault and they're not doing wrong: 
Spark just doesn't support it.
   
   As I commented below, maybe disallowing the cases (with guiding alternative) 
until we make it correct would be better approach, but I'd defer to 
committers/PMCs on how to handle this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to