[PR] [SPARK-50007][SQL][SS] Provide default values for metrics on observe API when physical node is lost in executed plan [spark]

via GitHub Thu, 17 Oct 2024 02:02:53 -0700


HeartSaVioR opened a new pull request, #48517:
URL: https://github.com/apache/spark/pull/48517


   ### What changes were proposed in this pull request?
   
   This PR proposes to provide default values for metrics on observe API, when 
physical node (CollectMetricsExec) is lost in executed plan. This includes the 
case where logical node (CollectMetrics) is lost during optimization (and it's 
mostly the case).
   
   ### Why are the changes needed?
   
   When user defines the metrics via observe API, they expect the metrics to be 
retrieved via Observation (batch query) or update event of 
StreamingQueryListener.
   
   But when the node (CollectMetrics) is lost in any reason (e.g. subtree is 
pruned by PruneFilters), Spark does behave like the metrics were not defined, 
instead of providing default values.
   
   When the query runs successfully, user wouldn't expect the metric being 
bound to the query to be unavailable, hence they missed to guard the code for 
this case and encountered some issue. Arguably it's lot better to provide 
default values - when the node is pruned out from optimizer, it is mostly 
logically equivalent that there were no input being processed with the node 
(except the bug in analyzer/optimizer/etc which drop the node incorrectly), 
hence it's valid to just have default value.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, user can consistently query about metrics being defined with observe 
API. It is available even with aggressive optimization which drop the 
CollectMetrics(Exec) node.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some 
test cases that check the changes thoroughly including negative and positive 
cases if possible.
   If it was tested in a way different from regular unit tests, please clarify 
how you tested step by step, ideally copy and paste-able, so that other 
reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why 
it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions 
for the consistent environment, and the instructions could accord to: 
https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-50007][SQL][SS] Provide default values for metrics on observe API when physical node is lost in executed plan [spark]

Reply via email to