linhongliu-db opened a new pull request #35856:
URL: https://github.com/apache/spark/pull/35856


   ### What changes were proposed in this pull request?
   In Spark, the UI lacks troubleshooting abilities. For example:
   * AQE plan changes are not available
   * plan description of a large plan is truncated
   This is because the live UI depends on an in-memory KV store. We should 
always be worried
   about the stability issues when adding more information to the store. 
Therefore, it's better to
   add a disk-based store to save more information
   
   This PR includes:
   * A disk-based KV Store in AppStatusStore that allows adding information 
that does not fits in memory
   * A separate listener that collects diagnostic data and saves it to the disk 
store
   * New Rest API endpoint to expose the diagnostics data (AQE plan changes, 
untruncated plan)
   
   ### Why are the changes needed?
   I made a 
[doc](https://docs.google.com/document/d/1tQMx278fRcelErv_qP1ovMh69W-HdDcgYzJ0L4EqDQ4/edit#heading=h.1n7jwkww4m9w)
 to summarize the observability issues in Spark.
   Among all the issues, the troubleshooting ability is the most urgent. 
Because without this, it's hard to
   debug AQE related issues. Once we solve the blockers, we can make a 
long-term plan to improve the
   observability.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, a new REST API to expose more information of the application.
   Rest API endpoint: 
http://localhost:4040/api/v1/applications/local-1647312132944/diagnostics/0
   Example:
   ```
   $ ./bin/spark-shell --conf spark.appStatusStore.diskStore.dir=/tmp/diskstore
   spark-shell>
   val df = sql(
     """SELECT t1.*, t2.c, t3.d
       |  FROM (SELECT 1 as a, 'b' as b) t1
       |  JOIN (SELECT 1 as a, 'c' as c) t2
       |  ON t1.a = t2.a
       |  JOIN (SELECT 1 as a, 'd' as d) t3
       |  ON t2.a = t3.a
       |""".stripMargin)
   df.show()
   ```
   Output:
   ```json
   {
     "id" : 0,
     "physicalPlan" : "<plan description string>",
     "submissionTime" : "2022-03-15T03:41:42.226GMT",
     "completionTime" : "2022-03-15T03:41:43.387GMT",
     "errorMessage" : "",
     "planChanges" : [ {
       "physicalPlan" : "<plan description string>",
       "updateTime" : "2022-03-15T03:41:42.268GMT"
     }, {
       "physicalPlan" : "<plan description string>",
       "updateTime" : "2022-03-15T03:41:43.262GMT"
     } ]
   }
   ```
   
   ### How was this patch tested?
   manually test
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to