shaneknapp edited a comment on issue #23514: [SPARK-24902][K8s] Add PV integration tests URL: https://github.com/apache/spark/pull/23514#issuecomment-459518248 ok, so i've figured out *what's* happening, just not *why*... when the PV tests start, it creates the pods successfully, but it fails (across all workers) to launch the `spark-kubernetes-driver`. however, it successfully launches this container for the previous builds. compare a successful launch output to the failed, in https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/7648/artifact/resource-managers/kubernetes/integration-tests/target/integration-tests.log: successful: ``` 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: 2019-01-31 11:35:51 INFO LoggingPodStatusWatcherImpl:57 - State changed, new state: 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: pod name: spark-test-app-864b470da34848bfa3b7639afd063d1a 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: namespace: 629aed1fc5aa43459461481981ea34e1 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: labels: spark-app-locator -> 4433c6b598624c498096954aad041402, spark-app-selector -> spark-2bbc2c2903be49808430cd2ac97c75e4, spark-role -> driver, template-label-key -> driver-template-label-value 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: pod uid: 6b02e7b4-258f-11e9-8053-442018636347 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: creation time: 2019-01-31T19:35:51Z 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: service account name: default 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: volumes: spark-local-dir-1, pod-template-volume, spark-conf-volume, default-token-vtv9b 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: node name: minikube 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: start time: 2019-01-31T19:35:51Z 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: phase: Pending 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: container status: 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: container name: test-driver-container <---- LOOK, IT WORKED! 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: container image: docker.io/kubespark/spark:7d7f1fb1-b9cf-415c-a885-f8b79b525b83 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: container state: waiting 19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: pending reason: ContainerCreating 19/01/31 11:35:52.036 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: 2019-01-31 11:35:52 INFO Client:57 - Deployed Spark application spark-test-app into Kubernetes. 19/01/31 11:35:52.054 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: 2019-01-31 11:35:52 INFO ShutdownHookManager:57 - Shutdown hook called 19/01/31 11:35:52.055 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: 2019-01-31 11:35:52 INFO ShutdownHookManager:57 - Deleting directory /tmp/spark-50a3bf8d-6584-492b-95da-2fffe26906b6 19/01/31 11:35:52.191 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: Beginning watch of executors 19/01/31 11:36:00.253 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: Ending watch of executors 19/01/31 11:36:08.443 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: ===== FINISHED o.a.s.deploy.k8s.integrationtest.KubernetesSuite: 'Start pod creation from template' ===== ``` failed (pay attention to the timestamps): ``` 19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: 2019-01-31 11:36:19 INFO LoggingPodStatusWatcherImpl:57 - State changed, new state: 19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: pod name: spark-test-app-fd05666a014749b5beb04dc194e39b6f 19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: namespace: 629aed1fc5aa43459461481981ea34e1 19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: labels: spark-app-locator -> 139da53bf4a34d0295cc3c2c8fb58baa, spark-app-selector -> spark-bee7d4dac4ac4a2ebd54a59d0856e1c3, spark-role -> driver 19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: pod uid: 7bb7504a-258f-11e9-8053-442018636347 19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: creation time: 2019-01-31T19:36:19Z 19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: service account name: default 19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: volumes: spark-local-dir-1, data, spark-conf-volume, default-token-hhmbr 19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: node name: N/A 19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: start time: N/A 19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: phase: Pending 19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: container status: N/A <---- LOOK IT DIDN'T WORK! 19/01/31 11:36:20.033 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: 2019-01-31 11:36:20 INFO Client:57 - Deployed Spark application spark-test-app into Kubernetes. 19/01/31 11:36:20.040 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: 2019-01-31 11:36:20 INFO ShutdownHookManager:57 - Shutdown hook called 19/01/31 11:36:20.041 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: 2019-01-31 11:36:20 INFO ShutdownHookManager:57 - Deleting directory /tmp/spark-264248f2-a858-4837-adea-e459df8a027f 19/01/31 11:36:20.231 OkHttp https://192.168.39.230:8443/... ERROR ExecWebSocketListener: Exec Failure: HTTP:400. Message:pod spark-test-app-fd05666a014749b5beb04dc194e39b6f does not have a host assigned java.net.ProtocolException: Expected HTTP 101 response but was '400 Bad Request' at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:219) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:186) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:153) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` note how we don't see that the container is running when the pod is initialized. then, a few seconds later the http response changes from 400 to 500: ``` 19/01/31 11:36:20.767 OkHttp https://192.168.39.230:8443/... ERROR ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found ("spark-kubernetes-driver") java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error' ``` meanwhile, in the output from `kubectl --namespace=629aed1fc5aa43459461481981ea34e1 get pods --show-all`, i see the pod status for `spark-test-app-fd05666a014749b5beb04dc194e39b6f` change at the following timestamps: Thu Jan 31 11:36:19 PST 2019: Pending Thu Jan 31 11:36:20 PST 2019: ContainerCreating Thu Jan 31 11:36:21 PST 2019: Running Thu Jan 31 11:36:24 PST 2019: Error cross-checking the `integration-tests.log` output on jenkins for the above timestamps shows that once the pod is marked as `Running`, we switch from the 400 to 500 error. two minutes later, the test times out and fails. so basically, what i think is happening is that the PVTestSuite isn't actually launching the `spark-kubernetes-driver` container (duh), and for some reason the `driverPodChecker` call in `runDFSReadWriteAndVerifyCompletion` isn't catching the fact that the container isn't running. thoughts? i can't dive much deeper in to the scala debugging as the extent of my experience there is an online tutorial a couple of years back. ;)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
