[
https://issues.apache.org/jira/browse/SPARK-50144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jungtaek Lim resolved SPARK-50144.
----------------------------------
Fix Version/s: 4.0.0
Resolution: Fixed
Issue resolved by pull request 48676
[https://github.com/apache/spark/pull/48676]
> Address the limitation of metrics calculation with DSv1 source in streaming
> query
> ---------------------------------------------------------------------------------
>
> Key: SPARK-50144
> URL: https://issues.apache.org/jira/browse/SPARK-50144
> Project: Spark
> Issue Type: Bug
> Components: Structured Streaming
> Affects Versions: 4.0.0
> Reporter: Jungtaek Lim
> Assignee: Jungtaek Lim
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
>
> In streaming query, we calculate the number of output rows per stream, via
> collecting the metric from the source nodes in the executed plan.
> For DSv2 data sources, the source nodes in the executed plan are always
> MicroBatchScanExec, and these nodes contain the stream information.
> But for DSv1 data sources, the logical node and the physical node
> representing the scan of the source are technically arbitrary (any logical
> node and any physical node), hence Spark makes an assumption that the leaf
> nodes for initial logical plan <=> logical plan for batch N <=> physical plan
> for batch N are the same so that we can associate these nodes. This is
> fragile and we have non-trivial number of reports of broken metric.
> This ticket aims to address the limitation for DSv1 streaming source; the
> idea is to scope the logical/physical nodes to the "widely-used set" and pass
> the stream information into these nodes, so that we can use the same approach
> of calculating metrics with DSv2 to DSv1 streaming sources.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]