HeartSaVioR commented on a change in pull request #24890:
[SPARK-28074][DOC][SS] Document caveats on using multiple stateful operations
in single query
URL: https://github.com/apache/spark/pull/24890#discussion_r295074607
##########
File path: docs/structured-streaming-programming-guide.md
##########
@@ -3146,6 +3146,17 @@ See [Input Sources](#input-sources) and [Output
Sinks](#output-sinks) sections f
- After `coalesce`, the number of (reduced) tasks will be kept unless
another shuffle happens.
- `spark.sql.streaming.stateStore.providerClass`: To read the previous state
of the query properly, the class of state store provider should be unchanged.
- `spark.sql.streaming.multipleWatermarkPolicy`: Modification of this would
lead inconsistent watermark value when query contains multiple watermarks,
hence the policy should be unchanged.
+- Structured Streaming uses `global watermark` which might impact on query
having multiple stateful operations.
+ - Stateful operators: aggregation, deduplication, stream-stream join,
(flat)mapGroupsWithState
+ - You should be able to answer below questions for your query to get correct
outputs:
+ - How global watermark is calculated on your query?
+ - How global watermark is applied to each stateful operator?
+ - Is there any intermediate output being discarded as "late input" due to
watermark?
+ - Fail to answer above questions might lead to incorrect outputs - e.g.
intermediate outputs being discarded.
Review comment:
https://issues.apache.org/jira/browse/SPARK-28094 is the example on problem,
as I tried to help resolving that issue (in user mailing list) and found the
correctness issue.
> If my answers are what then what do I do?
The set of questions are asking theirselves as whether they're fully
understanding the details of watermark, because it could bring "unexpected"
discard on rows between stateful operators, say, correctness issue.
So it's like an exam to prevent end users to "just do it as Spark allows it"
without fully understanding what they're doing. If they can't answer with their
query, they should find alternative approach.
> what do you mean fail to answer the question?
Literally. If they are having any small unsure thing about answering these
question they fail. I know it's too hard on end users, but well, better than
realizing incorrect outputs in production env.
> can this be more direct? what's an example of a problem, cause and
solution?
I'm happy to do it but not sure how much I need to explain with details. If
we want to have detailed explanation like providing example, maybe better to
have individual chapter. It might be long enough to use as a blog post. This
is an advanced topic - the book "Streaming Systems" assigns a chapter for
watermark - so may not be feasible to add the details to the guide page, but
due to the importance we should mention it. So that looks to be a dilemma.
To be honest, there's no "solution", only workaround, since it's due to lack
of feature - watermark propagation and stateful operator-wise watermark.
https://issues.apache.org/jira/browse/SPARK-26655 tries to address this, but
not much efforts are here due to lack of interests.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]