yaooqinn commented on issue #25964: [SPARK-29287][Core] Add ExecutorConstructed 
message to tell driver which executor is ready for making offers
URL: https://github.com/apache/spark/pull/25964#issuecomment-548652319
 
 
   Hi, @jiangxb1987 
   In high load clusters, but in fact, it is the normal production environment, 
to achieve high stability, we always increase various kinds of timeouts and may 
turn on blacklisting,  to increase resource utilization, we turn on dynamic 
allocation. 
   
   On yarn for example,
   
   For executor to register local shuffler server on start, we need to 
increase`spark.shuffle.registration.timeout` and 
`spark.shuffle.registration.maxAttempts`. the time cost here could reach 
maxAttempts * (timeout + 5s), which might be also time cost for a doomed 
failure task. We used to increase these configurations to solve the problem of 
too much pressure on NM for stability but in fact, it's achieving the opposite 
goal.
   
   For the `HeartbeatReceiver` on driver side, it uses 
`spark.storage.blockManagerSlaveTimeoutMs` with `spark.network.timeout` as its 
default value to expire dead hosts. To achieve stability again, we increase 
these timeouts, which also makes the driver spend more time to kill these tasks 
with `ExecutorLostFailure`.
   
   When we turn on blacklisting and dynamic allocation, things are getting 
worse sometimes. When spark submits a stage contains thousands of tasks, at the 
end of this stage, there may be few tasks left, and because of dynamic 
allocation, there only few executors left, there's a certain probability that 
these executors are all `not fully constructed` ones but with tasks scheduled. 
When these tasks failed and the blacklisting disable those few executors, the 
tasks will have nowhere to go and fail the job entirely. In our real cases, I 
have my eye such job failures when users set spark jobs with the dynamic 
minExecutors = 1 and blacklisting on.
    

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to