[ 
https://issues.apache.org/jira/browse/SPARK-33763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277675#comment-17277675
 ] 

Attila Zsolt Piros commented on SPARK-33763:
--------------------------------------------

I am ready with the executor removals (1 and 4 from the above list) but I would 
like to discuss stage resubmitted (I think you probably meant stage and not 
jobs) and fetch failures.

I thought about these two missing metrics (and checked the code too) and I have 
a suggestion. Let's combine those two to one single metric: stage resubmitted 
because of fetch failure. 

Justification: the number of fetch failures will be very much dependant on the 
cluster size (and even worse on the 
data).  When one executor is down all the others fetching from that failed one 
will report a fetch failure. So this information is not that helpful as it 
depends on how many reducers are referring to that single mapper.

[~holden] what do you think? 

> Add metrics for better tracking of dynamic allocation
> -----------------------------------------------------
>
>                 Key: SPARK-33763
>                 URL: https://issues.apache.org/jira/browse/SPARK-33763
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.2.0
>            Reporter: Holden Karau
>            Priority: Major
>
> We should add metrics to track the following:
> 1- Graceful decommissions & DA scheduled deletes
> 2- Jobs resubmitted
> 3- Fetch failures
> 4- Unexpected (e.g. non-Spark triggered) executor removals.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to