yaooqinn commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1808450794

   Hi @tgravescs, thank you for the detailed review.
   
   
   > 20 is a lot of failures. What is the real issue causing this? ie why are 
these executors failing? 
   
   The failures can be divided into two kinds. The first one is for both 
existing and new executors, i.e. exit on 143(killed by resource managers), oom, 
etc., which is OK to fail the app w/ or w/o this PR. The second one is only for 
new executors, i.e., some of the external dependencies file changes by expected 
or unexpected maintenance behaviors or rejections from resource managers, which 
this PR mainly focuses on to reduce the risk of an app being killed all of a 
sudden. In the second case, 20 is a relatively small number, as the allocating 
requests and responses go very quickly. 
   
   > How long was the app running? Is it some cloud environment they are going 
away, is it really an issue with the application or its configuration?
   
   The app I described above in the PR description ran for 1.5 hours. It failed 
because it hit the max executor failures while the root cause was one of the 
shared UDF jar changed by a developer, who turned out not to be the app owner. 
Yarn failed to bring up new executors, so the 20 failures were collected within 
10 secs.
   
   ```
   2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container 
container_e106_1694175944291_7158886_01_000027 on host: x.163.org (state: 
COMPLETE, exit status: -1000)
   2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: 
container_e106_1694175944291_7158886_01_000027 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - 
Requesting driver to remove executor 26 for reason Container from a bad node: 
container_e106_1694175944291_7158886_01_000027 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container 
container_e106_1694175944291_7158886_01_000029 on host: x.163.org (state: 
COMPLETE, exit status: -1000)
   2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: 
container_e106_1694175944291_7158886_01_000029 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST BlockManagerMaster INFO - Removal of executor 26 
requested
   2023-11-06 23:39:43 CST BlockManagerMasterEndpoint INFO - Trying to remove 
executor 26 from BlockManagerMaster.
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnDriverEndpoint INFO - Asked 
to remove non-existent executor 26
   2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container 
container_e106_1694175944291_7158886_01_000028 on host: x.163.org (state: 
COMPLETE, exit status: -1000)
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - 
Requesting driver to remove executor 28 for reason Container from a bad node: 
container_e106_1694175944291_7158886_01_000029 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: 
container_e106_1694175944291_7158886_01_000028 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST BlockManagerMaster INFO - Removal of executor 28 
requested
   2023-11-06 23:39:43 CST BlockManagerMasterEndpoint INFO - Trying to remove 
executor 28 from BlockManagerMaster.
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnDriverEndpoint INFO - Asked 
to remove non-existent executor 28
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - 
Requesting driver to remove executor 27 for reason Container from a bad node: 
container_e106_1694175944291_7158886_01_000028 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container 
container_e106_1694175944291_7158886_01_000026 on host: x.163.org (state: 
COMPLETE, exit status: -1000)
   2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: 
container_e106_1694175944291_7158886_01_000026 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST BlockManagerMaster INFO - Removal of executor 27 
requested
   2023-11-06 23:39:43 CST BlockManagerMasterEndpoint INFO - Trying to remove 
executor 27 from BlockManagerMaster.
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnDriverEndpoint INFO - Asked 
to remove non-existent executor 27
   2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container 
container_e106_1694175944291_7158886_01_000031 on host: x.163.org (state: 
COMPLETE, exit status: -1000)
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - 
Requesting driver to remove executor 25 for reason Container from a bad node: 
container_e106_1694175944291_7158886_01_000026 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: 
container_e106_1694175944291_7158886_01_000031 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.316]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST BlockManagerMaster INFO - Removal of executor 25 
requested
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnDriverEndpoint INFO - Asked 
to remove non-existent executor 25
   2023-11-06 23:39:43 CST BlockManagerMasterEndpoint INFO - Trying to remove 
executor 25 from BlockManagerMaster.
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - 
Requesting driver to remove executor 30 for reason Container from a bad node: 
container_e106_1694175944291_7158886_01_000031 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.316]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container 
container_e106_1694175944291_7158886_01_000033 on host: x.jd.163.org (state: 
COMPLETE, exit status: -1000)
   2023-11-06 23:39:43 CST BlockManagerMaster INFO - Removal of executor 30 
requested
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnDriverEndpoint INFO - Asked 
to remove non-existent executor 30
   2023-11-06 23:39:43 CST BlockManagerMasterEndpoint INFO - Trying to remove 
executor 30 from BlockManagerMaster.
   2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: 
container_e106_1694175944291_7158886_01_000033 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.316]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container 
container_e106_1694175944291_7158886_01_000030 on host: x.163.org (state: 
COMPLETE, exit status: -1000)
   2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: 
container_e106_1694175944291_7158886_01_000030 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.316]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - 
Requesting driver to remove executor 32 for reason Container from a bad node: 
container_e106_1694175944291_7158886_01_000033 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.316]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container 
container_e106_1694175944291_7158886_01_000032 on host: x.163.org (state: 
COMPLETE, exit status: -1000)
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - 
Requesting driver to remove executor 29 for reason Container from a bad node: 
container_e106_1694175944291_7158886_01_000030 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.316]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: 
container_e106_1694175944291_7158886_01_000032 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.316]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   ```
   
   We have a monitor for all our spark apps on Kubernetes and Yarn. The 
probability of apps failing with executor max failures is low for the total 
amount apps. But it turns out to be a daily issue. See
   
   
![image](https://github.com/apache/spark/assets/8326978/73f5adc7-15bf-4f79-8d06-a19abedd219c)
   
   
![image](https://github.com/apache/spark/assets/8326978/9a2b0ebd-126e-441e-81dc-a7ab7ff67e76)
   
   > how does Spark know it would have finished and those wouldn't have also 
failed? The point of the feature and the existing settings are that if you have 
had that many failures something is likely wrong and you need to fix it. it may 
have been that by letting this go longer it would have just wasted more time 
and resources if those other ones were also going to fail.
   
   As I have answered the first question, spark knows(might be delayed) to 
finish or fail. Both the failed executors and live ones are still being 
counted. Considering the delay and reliability, TBH, I haven't got a silver 
bullet for both of them. So, `ratio > 0` is provided to eliminate the delay and 
fail the app directly.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to