warrenzhu25 commented on PR #38441:
URL: https://github.com/apache/spark/pull/38441#issuecomment-1297682478

   > Could you explain how it can prevent your problem completely? That sounds 
still greedy and best-effort approach to mitigate your issue.
   > 
   > > What I mean is something like below, the memory added is constant 
regardless of cluster size. @dongjoon-hyun
   > > ```
   > > // limit the set size to 100 or 1000
   > > val executorsRemovedDueToDecom = new HashSet[String]
   > > 
   > > executorsPendingDecommission.remove(executorId)
   > > executorsRemovedDueToDecom.add(executorId)
   > > ```
   
   FetchFailed caused by decom executor could be divided into 2 categories:
   
   1. When FetchFailed reached DAGScheduler, the executor is still alive or is 
lost but the lost info hasn't reached TaskSchedulerImpl. This is already 
handled in [SPARK-40979](https://issues.apache.org/jira/browse/SPARK-40979)
   2. FetchFailed is caused by decom executor loss, so the decom info is 
already removed in TaskSchedulerImpl. If we keep such info in a short period, 
it is good enough. Even we limit the size of removed executors to 10K, it could 
be only at most 10MB memory usage. In real case, it's rare to have cluster size 
of over 10K and the chance that all these executors decomed and lost at the 
same time would be small.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to