Jungtaek Lim created SPARK-50144:
------------------------------------
Summary: Address the limitation of metrics calculation with DSv1
source in streaming query
Key: SPARK-50144
URL: https://issues.apache.org/jira/browse/SPARK-50144
Project: Spark
Issue Type: Improvement
Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Jungtaek Lim
In streaming query, we calculate the number of output rows per stream, via
collecting the metric from the source nodes in the executed plan.
For DSv2 data sources, the source nodes in the executed plan are always
MicroBatchScanExec, and these nodes contain the stream information.
But for DSv1 data sources, the logical node and the physical node representing
the scan of the source are technically arbitrary (any logical node and any
physical node), hence Spark makes an assumption that the leaf nodes for initial
logical plan <=> logical plan for batch N <=> physical plan for batch N are the
same so that we can associate these nodes. This is fragile and we have
non-trivial number of reports of broken metric.
This ticket aims to address the limitation for DSv1 streaming source; the idea
is to scope the logical/physical nodes to the "widely-used set" and pass the
stream information into these nodes, so that we can use the same approach of
calculating metrics with DSv2 to DSv1 streaming sources.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]