HeartSaVioR commented on a change in pull request #24890: [SPARK-28074][SS] Log warn message on possible correctness issue for multiple stateful operations in single query URL: https://github.com/apache/spark/pull/24890#discussion_r329247315
########## File path: docs/structured-streaming-programming-guide.md ########## @@ -1647,6 +1647,27 @@ For example, sorting on the input stream is not supported, as it requires keepin track of all the data received in the stream. This is therefore fundamentally hard to execute efficiently. +### Limitation of global watermark + +In some circumstance, some stateful operations could emit rows older than current watermark (with allowed delay), Review comment: I think we can just remove "In some circumstance", as that's expected behavior as always. * Streaming aggregation in Append mode: only evicted rows are emitted by nature, as it should wait for inputs until watermark passes by, to ensure there's no more rows to aggregate with such key. * Outer Join (any mode): it basically emits matched rows, but it also emits evicted rows if the row haven't matched with other side of row. That's what outer join is, so it emits evicted rows conditionally, but mostly expected. * FlatMapGroupsWithState in Append mode: it strongly depends on the implementation of state function, but if state function respects the semantic of Append mode, it could high likely emit evicted rows to ensure there's no further input rows to affect emitted rows. In fact, emitting evicted rows are tied to the semantic of Append mode. > how can users know if this affects them? We'll log warning message during checking unsupported operations. We can easily change the behavior to block the query as unsupported as well, so it's up to the community (mostly committers/PMCs) decision. > is it what is described below? Yes. And with SPARK-24634 we can measure the number of late rows per stateful operator in runtime. (For now there's no information on late rows. Spark discards them with no metrics.) First stateful operator could have late rows, but following state operators shouldn't. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
