yaooqinn commented on PR #43746: URL: https://github.com/apache/spark/pull/43746#issuecomment-1808450794
Hi @tgravescs, thank you for the detailed review. > 20 is a lot of failures. What is the real issue causing this? ie why are these executors failing? The failures can be divided into two kinds. The first one is for both existing and new executors, i.e. exit on 143(killed by resource managers), oom, etc., which is OK to fail the app w/ or w/o this PR. The second one is only for new executors, i.e., some of the external dependencies file changes by expected or unexpected maintenance behaviors or rejections from resource managers, which this PR mainly focuses on to reduce the risk of an app being killed all of a sudden. In the second case, 20 is a relatively small number, as the allocating requests and responses go very quickly. > How long was the app running? Is it some cloud environment they are going away, is it really an issue with the application or its configuration? The app I described above in the PR description ran for 1.5 hours. It failed because it hit the max executor failures while the root cause was one of the shared UDF jar changed by a developer, who turned out not to be the app owner. Yarn failed to bring up new executors, so the 20 failures were collected within 10 secs. ``` 2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container container_e106_1694175944291_7158886_01_000027 on host: x.163.org (state: COMPLETE, exit status: -1000) 2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: container_e106_1694175944291_7158886_01_000027 on host: x.163.org. Exit status: -1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource x.jar changed on src filesystem (expected 1698924864275, was 1699273405453 . 2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - Requesting driver to remove executor 26 for reason Container from a bad node: container_e106_1694175944291_7158886_01_000027 on host: x.163.org. Exit status: -1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource x.jar changed on src filesystem (expected 1698924864275, was 1699273405453 . 2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container container_e106_1694175944291_7158886_01_000029 on host: x.163.org (state: COMPLETE, exit status: -1000) 2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: container_e106_1694175944291_7158886_01_000029 on host: x.163.org. Exit status: -1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource x.jar changed on src filesystem (expected 1698924864275, was 1699273405453 . 2023-11-06 23:39:43 CST BlockManagerMaster INFO - Removal of executor 26 requested 2023-11-06 23:39:43 CST BlockManagerMasterEndpoint INFO - Trying to remove executor 26 from BlockManagerMaster. 2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnDriverEndpoint INFO - Asked to remove non-existent executor 26 2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container container_e106_1694175944291_7158886_01_000028 on host: x.163.org (state: COMPLETE, exit status: -1000) 2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - Requesting driver to remove executor 28 for reason Container from a bad node: container_e106_1694175944291_7158886_01_000029 on host: x.163.org. Exit status: -1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource x.jar changed on src filesystem (expected 1698924864275, was 1699273405453 . 2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: container_e106_1694175944291_7158886_01_000028 on host: x.163.org. Exit status: -1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource x.jar changed on src filesystem (expected 1698924864275, was 1699273405453 . 2023-11-06 23:39:43 CST BlockManagerMaster INFO - Removal of executor 28 requested 2023-11-06 23:39:43 CST BlockManagerMasterEndpoint INFO - Trying to remove executor 28 from BlockManagerMaster. 2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnDriverEndpoint INFO - Asked to remove non-existent executor 28 2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - Requesting driver to remove executor 27 for reason Container from a bad node: container_e106_1694175944291_7158886_01_000028 on host: x.163.org. Exit status: -1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource x.jar changed on src filesystem (expected 1698924864275, was 1699273405453 . 2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container container_e106_1694175944291_7158886_01_000026 on host: x.163.org (state: COMPLETE, exit status: -1000) 2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: container_e106_1694175944291_7158886_01_000026 on host: x.163.org. Exit status: -1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource x.jar changed on src filesystem (expected 1698924864275, was 1699273405453 . 2023-11-06 23:39:43 CST BlockManagerMaster INFO - Removal of executor 27 requested 2023-11-06 23:39:43 CST BlockManagerMasterEndpoint INFO - Trying to remove executor 27 from BlockManagerMaster. 2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnDriverEndpoint INFO - Asked to remove non-existent executor 27 2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container container_e106_1694175944291_7158886_01_000031 on host: x.163.org (state: COMPLETE, exit status: -1000) 2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - Requesting driver to remove executor 25 for reason Container from a bad node: container_e106_1694175944291_7158886_01_000026 on host: x.163.org. Exit status: -1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource x.jar changed on src filesystem (expected 1698924864275, was 1699273405453 . 2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: container_e106_1694175944291_7158886_01_000031 on host: x.163.org. Exit status: -1000. Diagnostics: [2023-11-06 23:39:40.316]java.io.IOException: Resource x.jar changed on src filesystem (expected 1698924864275, was 1699273405453 . 2023-11-06 23:39:43 CST BlockManagerMaster INFO - Removal of executor 25 requested 2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnDriverEndpoint INFO - Asked to remove non-existent executor 25 2023-11-06 23:39:43 CST BlockManagerMasterEndpoint INFO - Trying to remove executor 25 from BlockManagerMaster. 2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - Requesting driver to remove executor 30 for reason Container from a bad node: container_e106_1694175944291_7158886_01_000031 on host: x.163.org. Exit status: -1000. Diagnostics: [2023-11-06 23:39:40.316]java.io.IOException: Resource x.jar changed on src filesystem (expected 1698924864275, was 1699273405453 . 2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container container_e106_1694175944291_7158886_01_000033 on host: x.jd.163.org (state: COMPLETE, exit status: -1000) 2023-11-06 23:39:43 CST BlockManagerMaster INFO - Removal of executor 30 requested 2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnDriverEndpoint INFO - Asked to remove non-existent executor 30 2023-11-06 23:39:43 CST BlockManagerMasterEndpoint INFO - Trying to remove executor 30 from BlockManagerMaster. 2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: container_e106_1694175944291_7158886_01_000033 on host: x.163.org. Exit status: -1000. Diagnostics: [2023-11-06 23:39:40.316]java.io.IOException: Resource x.jar changed on src filesystem (expected 1698924864275, was 1699273405453 . 2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container container_e106_1694175944291_7158886_01_000030 on host: x.163.org (state: COMPLETE, exit status: -1000) 2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: container_e106_1694175944291_7158886_01_000030 on host: x.163.org. Exit status: -1000. Diagnostics: [2023-11-06 23:39:40.316]java.io.IOException: Resource x.jar changed on src filesystem (expected 1698924864275, was 1699273405453 . 2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - Requesting driver to remove executor 32 for reason Container from a bad node: container_e106_1694175944291_7158886_01_000033 on host: x.163.org. Exit status: -1000. Diagnostics: [2023-11-06 23:39:40.316]java.io.IOException: Resource x.jar changed on src filesystem (expected 1698924864275, was 1699273405453 . 2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container container_e106_1694175944291_7158886_01_000032 on host: x.163.org (state: COMPLETE, exit status: -1000) 2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - Requesting driver to remove executor 29 for reason Container from a bad node: container_e106_1694175944291_7158886_01_000030 on host: x.163.org. Exit status: -1000. Diagnostics: [2023-11-06 23:39:40.316]java.io.IOException: Resource x.jar changed on src filesystem (expected 1698924864275, was 1699273405453 . 2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: container_e106_1694175944291_7158886_01_000032 on host: x.163.org. Exit status: -1000. Diagnostics: [2023-11-06 23:39:40.316]java.io.IOException: Resource x.jar changed on src filesystem (expected 1698924864275, was 1699273405453 . ``` We have a monitor for all our spark apps on Kubernetes and Yarn. The probability of apps failing with executor max failures is low for the total amount apps. But it turns out to be a daily issue. See   > how does Spark know it would have finished and those wouldn't have also failed? The point of the feature and the existing settings are that if you have had that many failures something is likely wrong and you need to fix it. it may have been that by letting this go longer it would have just wasted more time and resources if those other ones were also going to fail. As I have answered the first question, spark knows(might be delayed) to finish or fail. Both the failed executors and live ones are still being counted. Considering the delay and reliability, TBH, I haven't got a silver bullet for both of them. So, `ratio > 0` is provided to eliminate the delay and fail the app directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
