Jungtaek Lim created SPARK-50144:
------------------------------------

             Summary: Address the limitation of metrics calculation with DSv1 
source in streaming query
                 Key: SPARK-50144
                 URL: https://issues.apache.org/jira/browse/SPARK-50144
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 4.0.0
            Reporter: Jungtaek Lim


In streaming query, we calculate the number of output rows per stream, via 
collecting the metric from the source nodes in the executed plan.

For DSv2 data sources, the source nodes in the executed plan are always 
MicroBatchScanExec, and these nodes contain the stream information.

But for DSv1 data sources, the logical node and the physical node representing 
the scan of the source are technically arbitrary (any logical node and any 
physical node), hence Spark makes an assumption that the leaf nodes for initial 
logical plan <=> logical plan for batch N <=> physical plan for batch N are the 
same so that we can associate these nodes. This is fragile and we have 
non-trivial number of reports of broken metric.

This ticket aims to address the limitation for DSv1 streaming source; the idea 
is to scope the logical/physical nodes to the "widely-used set" and pass the 
stream information into these nodes, so that we can use the same approach of 
calculating metrics with DSv2 to DSv1 streaming sources.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to