HeartSaVioR opened a new pull request, #48517: URL: https://github.com/apache/spark/pull/48517
### What changes were proposed in this pull request? This PR proposes to provide default values for metrics on observe API, when physical node (CollectMetricsExec) is lost in executed plan. This includes the case where logical node (CollectMetrics) is lost during optimization (and it's mostly the case). ### Why are the changes needed? When user defines the metrics via observe API, they expect the metrics to be retrieved via Observation (batch query) or update event of StreamingQueryListener. But when the node (CollectMetrics) is lost in any reason (e.g. subtree is pruned by PruneFilters), Spark does behave like the metrics were not defined, instead of providing default values. When the query runs successfully, user wouldn't expect the metric being bound to the query to be unavailable, hence they missed to guard the code for this case and encountered some issue. Arguably it's lot better to provide default values - when the node is pruned out from optimizer, it is mostly logically equivalent that there were no input being processed with the node (except the bug in analyzer/optimizer/etc which drop the node incorrectly), hence it's valid to just have default value. ### Does this PR introduce _any_ user-facing change? Yes, user can consistently query about metrics being defined with observe API. It is available even with aggressive optimization which drop the CollectMetrics(Exec) node. ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks. --> ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
