HeartSaVioR commented on a change in pull request #24890:
[SPARK-28074][DOC][SS] Document caveats on using multiple stateful operations
in single query
URL: https://github.com/apache/spark/pull/24890#discussion_r295060594
##########
File path: docs/structured-streaming-programming-guide.md
##########
@@ -3146,6 +3146,17 @@ See [Input Sources](#input-sources) and [Output
Sinks](#output-sinks) sections f
- After `coalesce`, the number of (reduced) tasks will be kept unless
another shuffle happens.
- `spark.sql.streaming.stateStore.providerClass`: To read the previous state
of the query properly, the class of state store provider should be unchanged.
- `spark.sql.streaming.multipleWatermarkPolicy`: Modification of this would
lead inconsistent watermark value when query contains multiple watermarks,
hence the policy should be unchanged.
+- Structured Streaming uses `global watermark` which might impact on query
having multiple stateful operations.
+ - Stateful operators: aggregation, deduplication, stream-stream join,
(flat)mapGroupsWithState
+ - You should be able to answer below questions for your query to get correct
outputs:
+ - How global watermark is calculated on your query?
+ - How global watermark is applied to each stateful operator?
+ - Is there any intermediate output being discarded as "late input" due to
watermark?
+ - Fail to answer above questions might lead to incorrect outputs - e.g.
intermediate outputs being discarded.
+ - One of "alternative" approach is breaking down your query into multiple
chained queries, each per stateful operation.
+ - Each query must guarantee "end-to-end" exactly once, otherwise
intermediate outputs can be duplicated which leads to incorrect outputs.
+- Only 'Append mode' can be "semantically" correct for a query having multiple
stateful operations.
+ - In 'Update mode', downstream stateful operator cannot distinguish whether
the input is new, or updated.
Review comment:
Honestly I agree with @srowen regarding screenshot, as it doesn't draw some
UI like table. They're just bullet points with proper indentation, and if we
add it to PR description I should update the screenshot whenever I reflect
review comments, which sounds redundant.
Regarding too many bullet-pointed sentences, I agree there're some
unnecessary bullet points. I'll concat. Any feedbacks on nuance, writing style,
etc are welcome, as I'm not native speaker of English.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]