HeartSaVioR commented on a change in pull request #24890: [SPARK-28074][SS] Log 
warn message on possible correctness issue for multiple stateful operations in 
single query
URL: https://github.com/apache/spark/pull/24890#discussion_r329247315
 
 

 ##########
 File path: docs/structured-streaming-programming-guide.md
 ##########
 @@ -1647,6 +1647,27 @@ For example, sorting on the input stream is not 
supported, as it requires keepin
 track of all the data received in the stream. This is therefore fundamentally 
hard to execute 
 efficiently.
 
+### Limitation of global watermark
+
+In some circumstance, some stateful operations could emit rows older than 
current watermark (with allowed delay),
 
 Review comment:
   I think we can just remove "In some circumstance", as that's expected 
behavior as always.
   
   * Streaming aggregation in Append mode: only evicted rows are emitted by 
nature, as it should wait for inputs until watermark passes by, to ensure 
there's no more rows to aggregate with such key.
   * Outer Join (any mode): it basically emits matched rows, but it also emits 
evicted rows if the row haven't matched with other side of row. That's what 
outer join is, so it emits evicted rows conditionally, but mostly expected.
   * FlatMapGroupsWithState in Append mode: it strongly depends on the 
implementation of state function, but if state function respects the semantic 
of Append mode, it could high likely emit evicted rows to ensure there's no 
further input rows to affect emitted rows.
   
   In fact, emitting evicted rows are tied to the semantic of Append mode.
   
   > how can users know if this affects them? 
   
   We'll log warning message during checking unsupported operations. We can 
easily change the behavior to block the query as unsupported as well, so it's 
up to the community (mostly committers/PMCs) decision.
   
   > is it what is described below?
   
   Yes. And with SPARK-24634 we can measure the number of late rows per 
stateful operator in runtime. (For now there's no information on late rows. 
Spark discards them with no metrics.) First stateful operator could have late 
rows, but following state operators shouldn't.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to