yaooqinn commented on issue #25964: [SPARK-29287][Core] Add ExecutorConstructed message to tell driver which executor is ready for making offers URL: https://github.com/apache/spark/pull/25964#issuecomment-548652319 Hi, @jiangxb1987 In high load clusters, but in fact, it is the normal production environment, to achieve high stability, we always increase various kinds of timeouts and may turn on blacklisting, to increase resource utilization, we turn on dynamic allocation. On yarn for example, For executor to register local shuffler server on start, we need to increase`spark.shuffle.registration.timeout` and `spark.shuffle.registration.maxAttempts`. the time cost here could reach maxAttempts * (timeout + 5s), which might be also time cost for a doomed failure task. We used to increase these configurations to solve the problem of too much pressure on NM for stability but in fact, it's achieving the opposite goal. For the `HeartbeatReceiver` on driver side, it uses `spark.storage.blockManagerSlaveTimeoutMs` with `spark.network.timeout` as its default value to expire dead hosts. To achieve stability again, we increase these timeouts, which also makes the driver spend more time to kill these tasks with `ExecutorLostFailure`. When we turn on blacklisting and dynamic allocation, things are getting worse sometimes. When spark submits a stage contains thousands of tasks, at the end of this stage, there may be few tasks left, and because of dynamic allocation, there only few executors left, there's a certain probability that these executors are all `not fully constructed` ones but with tasks scheduled. When these tasks failed and the blacklisting disable those few executors, the tasks will have nowhere to go and fail the job entirely. In our real cases, I have my eye such job failures when users set spark jobs with the dynamic minExecutors = 1 and blacklisting on.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
