Prashant Sharma created SPARK-32371:
---------------------------------------
Summary: Autodetect persistently failing executor pods and fail
the application logging the cause.
Key: SPARK-32371
URL: https://issues.apache.org/jira/browse/SPARK-32371
Project: Spark
Issue Type: Improvement
Components: Kubernetes
Affects Versions: 3.1.0
Reporter: Prashant Sharma
{code:java}
[root@kyok-test-1 ~]# kubectl get po -w
NAME READY STATUS RESTARTS AGE
spark-shell-a3962a736bf9e775-exec-36 1/1 Running 0 5s
spark-shell-a3962a736bf9e775-exec-37 1/1 Running 0 3s
spark-shell-a3962a736bf9e775-exec-36 0/1 Error 0 5s
spark-shell-a3962a736bf9e775-exec-38 0/1 Pending 0 1s
spark-shell-a3962a736bf9e775-exec-38 0/1 Pending 0 1s
spark-shell-a3962a736bf9e775-exec-38 0/1 ContainerCreating 0 1s
spark-shell-a3962a736bf9e775-exec-36 0/1 Terminating 0 6s
spark-shell-a3962a736bf9e775-exec-36 0/1 Terminating 0 6s
spark-shell-a3962a736bf9e775-exec-37 0/1 Error 0 5s
spark-shell-a3962a736bf9e775-exec-38 1/1 Running 0 2s
spark-shell-a3962a736bf9e775-exec-39 0/1 Pending 0 0s
spark-shell-a3962a736bf9e775-exec-39 0/1 Pending 0 0s
spark-shell-a3962a736bf9e775-exec-39 0/1 ContainerCreating 0 0s
spark-shell-a3962a736bf9e775-exec-37 0/1 Terminating 0 6s
spark-shell-a3962a736bf9e775-exec-37 0/1 Terminating 0 6s
spark-shell-a3962a736bf9e775-exec-38 0/1 Error 0 4s
spark-shell-a3962a736bf9e775-exec-39 1/1 Running 0 1s
spark-shell-a3962a736bf9e775-exec-40 0/1 Pending 0 0s
spark-shell-a3962a736bf9e775-exec-40 0/1 Pending 0 0s
spark-shell-a3962a736bf9e775-exec-40 0/1 ContainerCreating 0 0s
spark-shell-a3962a736bf9e775-exec-38 0/1 Terminating 0 5s
spark-shell-a3962a736bf9e775-exec-38 0/1 Terminating 0 5s
spark-shell-a3962a736bf9e775-exec-39 0/1 Error 0 3s
spark-shell-a3962a736bf9e775-exec-40 1/1 Running 0 1s
spark-shell-a3962a736bf9e775-exec-41 0/1 Pending 0 0s
spark-shell-a3962a736bf9e775-exec-41 0/1 Pending 0 0s
spark-shell-a3962a736bf9e775-exec-41 0/1 ContainerCreating 0 0s
spark-shell-a3962a736bf9e775-exec-39 0/1 Terminating 0 4s
spark-shell-a3962a736bf9e775-exec-39 0/1 Terminating 0 4s
spark-shell-a3962a736bf9e775-exec-41 1/1 Running 0 2s
spark-shell-a3962a736bf9e775-exec-40 0/1 Error 0 4s
spark-shell-a3962a736bf9e775-exec-42 0/1 Pending 0 0s
spark-shell-a3962a736bf9e775-exec-42 0/1 Pending 0 0s
spark-shell-a3962a736bf9e775-exec-42 0/1 ContainerCreating 0 0s
spark-shell-a3962a736bf9e775-exec-40 0/1 Terminating 0 4s
spark-shell-a3962a736bf9e775-exec-40 0/1 Terminating 0 4s
{code}
A cascade of creating and terminating pods within 3-4 seconds, is created. It
is difficult to see the logs of these constantly created and terminated pods.
Thankfully, there is an option
{code:java}
spark.kubernetes.executor.deleteOnTermination false {code}
to turn off the auto deletion of executor pods, and gives us opportunity to
diagnose the problem. However, this is not turned on by default, and sometimes
one may need to guess what caused the problem the previous run and steps to
reproduce it and then re run the application with exact same setup to reproduce.
So, it might be good, if we could somehow detect this situation, of pod failing
as soon as they start or failing on particular task and capture the error that
caused the pod to terminate and relay it back to driver and log it.
Alternatively, if we could auto-detect this situation, we can also auto stop
creating more executor pods and fail with appropriate error also retaining the
last failed pod for user's further investigation.
So far it is not yet evaluated how this can be achieved, but, this feature
might be useful for K8s growing as a preferred choice for deploying spark.
Logging this issue for further investigation and work.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]