[ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17257844#comment-17257844
 ] 

Ron Hu edited comment on SPARK-26399 at 1/3/21, 7:44 PM:
---------------------------------------------------------

[~angerszhuuu] found that the "executorSummary" field already exists in stage 
REST API output.  In the existing stage json file, the "executorSummary" field 
contains a list of executor metrics for all executors used for a given stage.  
In addition to the detailed metrics information for each executor, we also need 
the percentile distribution among the executors.  This is because we need the 
percentile information in order to find out how bad a skew problem is.  For 
example, we compute the ratio of maximal value over median value and maximal 
value over 75th percentile value.  If the ratio of max-over-median is equal to 
5, there is a skew issue.  If the ratio of max-over-75th-percentile is equal to 
5, then there is a really bad skew issue.

In the attached file, you can see a sample image file of the "Summery Metrics 
for Executors' for a stage.  Its corresponding REST API output can be something 
like:

<<<< attach a json file here >>>>. 

Since the field name "executorSummary" already exists, we should change this 
REST API endpoint name.  We may change it to "executorMetricsSummary".  The new 
REST API can be:

http://<spark history 
server>:18080/api/v1/applications/<application_id>/<application_attempt/stages/<stage_id>/<stage_attempt>/executorMetricsSummary


was (Author: ron8hu):
[~angerszhuuu] found that the "executorSummary" field already exists in stage 
REST API output.  In the existing stage json file, the "executorSummary" field 
contains a list of executor metrics for all executors used for a given stage.  
In addition to the detailed metrics information for each executor, we also need 
the percentile distribution among the executors.  This is because we need the 
percentile information in order to find out how bad a skew problem is.  For 
example, we compute the ratio of maximal value over median value and maximal 
value over 75th percentile value.  If the ratio of max-over-median is equal to 
5, there is a skew issue.  If the ratio of max-over-75th-percentile is equal to 
5, then there is a really bad skew issue.

In the attached file, you can see a sample image file of the "Summery Metrics 
for Executors' for a stage.  Its corresponding REST API output can be something 
like:

{

  "quantiles" : [ 0.0, 0.25, 0.5, 0.75, 1.0 ],

  "numTasks" : [ 1.0, 1.0, 3.0, 3.0, 4.0 ],

  "inputBytes" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],

  "inputRecords" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],

  "outputBytes" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],

  "outputRecords" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],

  "shuffleRead" : [ 0.0, 2.50967876E8, 7.50516665E8, 7.51114124E8, 
1.001617709E9 ],

  "shuffleReadRecords" : [ 0.0, 740880.0, 2215608.0, 2217351.0, 2957194.0 ],

  "shuffleWrite" : [ 0.0, 2.3658701E8, 7.07482405E8, 7.08012783E8, 9.44322243E8 
],

  "shuffleWriteRecords" : [ 0.0, 726968.0, 2174281.0, 2176014.0, 2902184.0 ],

  "memoryBytesSpilled" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],

  "diskBytesSpilled" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],

  "peakJVMHeapMemory" : [ 2.09883992E8, 4.6213568E8, 7.5947948E8, 9.8473656E8, 
9.8473656E8 ],

  "peakJVMOffHeapMemory" : [ 6.0829472E7, 6.1343616E7, 6.271752E7, 9.1926448E7, 
9.1926448E7 ],

  "peakOnHeapExecutionMemory" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],

  "peakOffHeapExecutionMemory" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],

  "peakOnHeapStorageMemory" : [ 7023.0, 12537.0, 19560.0, 19560.0, 19560.0 ],

  "peakOffHeapStorageMemory" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],

  "peakOnHeapUnifiedMemory" : [ 7023.0, 12537.0, 19560.0, 19560.0, 19560.0 ],

  "peakOffHeapUnifiedMemory" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],

  "peakDirectPoolMemory" : [ 10742.0, 10865.0, 12781.0, 157182.0, 157182.0 ],

  "peakMappedPoolMemory" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],

  "peakProcessTreeJVMVMemory" : [ 8.296026112E9, 9.678606336E9, 9.684373504E9, 
9.691553792E9, 9.691553792E9 ],

  "peakProcessTreeJVMRSSMemory" : [ 5.26491648E8, 7.03639552E8, 9.64222976E8, 
1.210867712E9, 1.210867712E9 ]

}

 

Since the field name "executorSummary" already exists, we should change this 
REST API endpoint name.  We may change it to "executorMetricsSummary".  The new 
REST API can be:

http://<spark history 
server>:18080/api/v1/applications/<application_id>/<application_attempt/stages/<stage_id>/<stage_attempt>/executorMetricsSummary

> Add new stage-level REST APIs and parameters
> --------------------------------------------
>
>                 Key: SPARK-26399
>                 URL: https://issues.apache.org/jira/browse/SPARK-26399
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Spark Core
>    Affects Versions: 3.1.0
>            Reporter: Edward Lu
>            Priority: Major
>         Attachments: executorMetricsSummary.json, 
> stage_executorSummary_image1.png
>
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http://<spark history server>:18080/api/v1/applications/<application 
> id>/<application attempt/stages/<stage id>/<stage 
> attempt>/executorSummary{code}
> Add parameters to the stages REST API to specify:
> *  filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
> * task metric quantiles, add adding the task summary if specified
> * executor metric quantiles, and adding the executor summary if specified



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to