linhongliu-db opened a new pull request #35856: URL: https://github.com/apache/spark/pull/35856
### What changes were proposed in this pull request? In Spark, the UI lacks troubleshooting abilities. For example: * AQE plan changes are not available * plan description of a large plan is truncated This is because the live UI depends on an in-memory KV store. We should always be worried about the stability issues when adding more information to the store. Therefore, it's better to add a disk-based store to save more information This PR includes: * A disk-based KV Store in AppStatusStore that allows adding information that does not fits in memory * A separate listener that collects diagnostic data and saves it to the disk store * New Rest API endpoint to expose the diagnostics data (AQE plan changes, untruncated plan) ### Why are the changes needed? I made a [doc](https://docs.google.com/document/d/1tQMx278fRcelErv_qP1ovMh69W-HdDcgYzJ0L4EqDQ4/edit#heading=h.1n7jwkww4m9w) to summarize the observability issues in Spark. Among all the issues, the troubleshooting ability is the most urgent. Because without this, it's hard to debug AQE related issues. Once we solve the blockers, we can make a long-term plan to improve the observability. ### Does this PR introduce _any_ user-facing change? Yes, a new REST API to expose more information of the application. Rest API endpoint: http://localhost:4040/api/v1/applications/local-1647312132944/diagnostics/0 Example: ``` $ ./bin/spark-shell --conf spark.appStatusStore.diskStore.dir=/tmp/diskstore spark-shell> val df = sql( """SELECT t1.*, t2.c, t3.d | FROM (SELECT 1 as a, 'b' as b) t1 | JOIN (SELECT 1 as a, 'c' as c) t2 | ON t1.a = t2.a | JOIN (SELECT 1 as a, 'd' as d) t3 | ON t2.a = t3.a |""".stripMargin) df.show() ``` Output: ```json { "id" : 0, "physicalPlan" : "<plan description string>", "submissionTime" : "2022-03-15T03:41:42.226GMT", "completionTime" : "2022-03-15T03:41:43.387GMT", "errorMessage" : "", "planChanges" : [ { "physicalPlan" : "<plan description string>", "updateTime" : "2022-03-15T03:41:42.268GMT" }, { "physicalPlan" : "<plan description string>", "updateTime" : "2022-03-15T03:41:43.262GMT" } ] } ``` ### How was this patch tested? manually test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
