[ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260805#comment-17260805
 ] 

Ron Hu edited comment on SPARK-26399 at 1/7/21, 10:23 PM:
----------------------------------------------------------

Dr. Elephant ([https://github.com/linkedin/dr-elephant]) is a downstream open 
source product that utilizes Spark monitoring information so that it can advise 
Spark users where to optimize their configuration parameters ranging from 
memory usage, number of cores, etc.  Because the initial description of this 
ticket is too brief to be clear.  Let me explain the use cases for Dr. Elephant 
here. 

REST API /applications/[app-id]/stages: This useful endpoint generate a json 
file containing all stages for a given application.  The current Spark version 
already provides it.

In order to debug if there exists a skew issue, a downstream product also needs:
 - taskMetricsSummary: It includes task metric information such as 
executorRunTime, inputMetrics, outputMetrics,   shuffleReadMetrics, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the tasks in a 
given stage.  The same information shows up in Web UI for a specified stage.

 - executorMetricsSummary: It includes executor metrics information such as 
number of tasks, input bytes, peak JVM memory, peak execution memory, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the executors used 
in a given stage.  This information has been developed by [~angerszhuuu] in the 
PR he submitted.

We can add the above information to the the json file generated by 
/applications/[app-id]/stages. It may double the size of the stages endpoints 
file.  It should be fine because the current stages json file is not that big.  
Here is one sample json file for stages endpoint. 
[^lispark230_restapi_ex2_stages_withSummaries.json]

An alternative approach is to add a new REST API such as 
"/applications/[app-id]/stages/withSummaries".  But it may need a little bit 
more code for a new endpoint.


was (Author: ron8hu):
Dr. Elephant ([https://github.com/linkedin/dr-elephant]) is a downstream open 
source product that utilizes Spark monitoring information so that it can advise 
Spark users where to optimize their configuration parameters ranging from 
memory usage, number of cores, etc.  Because the initial description of this 
ticket is too brief to be clear.  Let me explain the use cases for Dr. Elephant 
here. 

REST API /applications/[app-id]/stages: This useful endpoint provides a list of 
all stages for a given application.  The current Spark version already provides 
it.

In order to debug if there exists a skew issue, a downstream product also needs:
 - taskMetricsSummary: It includes task metric information such as 
executorRunTime, inputMetrics, outputMetrics,   shuffleReadMetrics, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the tasks in a 
given stage.  The same information shows up in Web UI for a specified stage.

 - executorMetricsSummary: It includes executor metrics information such as 
number of tasks, input bytes, peak JVM memory, peak execution memory, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the executors used 
in a given stage.  This information has been developed by [~angerszhuuu] in the 
PR he submitted.

We can add the above information to the the json file generated by 
/applications/[app-id]/stages. It may double the size of the stages endpoints 
file.  It should be fine because the current stages json file is not that big.  
Here is one sample json file for stages endpoint. 
[^lispark230_restapi_ex2_stages_withSummaries.json]

An alternative approach is to add a new REST API such as 
"/applications/[app-id]/stages/withSummaries".  But it may need a little bit 
more code for a new endpoint.

> Add new stage-level REST APIs and parameters
> --------------------------------------------
>
>                 Key: SPARK-26399
>                 URL: https://issues.apache.org/jira/browse/SPARK-26399
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Spark Core
>    Affects Versions: 3.1.0
>            Reporter: Edward Lu
>            Priority: Major
>         Attachments: executorMetricsSummary.json, 
> lispark230_restapi_ex2_stages_withSummaries.json, 
> stage_executorSummary_image1.png
>
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http://<spark history 
> server>:18080/api/v1/applications/<application_id>/<application_attempt/stages/<stage_id>/<stage_attempt>/executorMetricsSummary{code}
> Add parameters to the stages REST API to specify:
>  * filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
>  * task metric quantiles, add adding the task summary if specified
>  * executor metric quantiles, and adding the executor summary if specified



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to