Jungtaek Lim created SPARK-42376:
------------------------------------

             Summary: Introduce watermark propagation among operators
                 Key: SPARK-42376
                 URL: https://issues.apache.org/jira/browse/SPARK-42376
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 3.5.0
            Reporter: Jungtaek Lim


With introduction of SPARK-40925, we enabled workloads containing multiple 
stateful operators in a single streaming query.

The JIRA ticket clearly described out-of-scope, "Here we propose fixing the 
late record filtering in stateful operators to allow chaining of stateful 
operators {*}which do not produce delayed records (like time-interval join or 
potentially flatMapGroupsWithState){*}".

We identified production use case for stream-stream time-interval join followed 
by stateful operator (e.g. window aggregation), and propose to address such use 
case via this ticket.

The design will be described in the PR, but the sketched idea is introducing 
simulation of watermark propagation among operators. As of now, Spark considers 
all stateful operators to have same input watermark and output watermark, which 
introduced the limitation. With this ticket, we construct the logic to simulate 
watermark propagation so that each operator can have its own (input watermark, 
output watermark). Operators introducing delayed records will produce delayed 
output watermark, and downstream operator can take the delay into account as 
input watermark will be adjusted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to