[jira] [Resolved] (SPARK-50144) Address the limitation of metrics calculation with DSv1 source in streaming query

Jungtaek Lim (Jira) Wed, 30 Oct 2024 14:36:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-50144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jungtaek Lim resolved SPARK-50144.
----------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Issue resolved by pull request 48676
[https://github.com/apache/spark/pull/48676]

> Address the limitation of metrics calculation with DSv1 source in streaming 
> query
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-50144
>                 URL: https://issues.apache.org/jira/browse/SPARK-50144
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 4.0.0
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> In streaming query, we calculate the number of output rows per stream, via 
> collecting the metric from the source nodes in the executed plan.
> For DSv2 data sources, the source nodes in the executed plan are always 
> MicroBatchScanExec, and these nodes contain the stream information.
> But for DSv1 data sources, the logical node and the physical node 
> representing the scan of the source are technically arbitrary (any logical 
> node and any physical node), hence Spark makes an assumption that the leaf 
> nodes for initial logical plan <=> logical plan for batch N <=> physical plan 
> for batch N are the same so that we can associate these nodes. This is 
> fragile and we have non-trivial number of reports of broken metric.
> This ticket aims to address the limitation for DSv1 streaming source; the 
> idea is to scope the logical/physical nodes to the "widely-used set" and pass 
> the stream information into these nodes, so that we can use the same approach 
> of calculating metrics with DSv2 to DSv1 streaming sources.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-50144) Address the limitation of metrics calculation with DSv1 source in streaming query

Reply via email to