shaneknapp edited a comment on issue #23514: [SPARK-24902][K8s] Add PV 
integration tests
URL: https://github.com/apache/spark/pull/23514#issuecomment-459518248
 
 
   ok, so i've figured out *what's* happening, just not *why*...
   
   when the PV tests start, it creates the pods successfully, but it fails 
(across all workers) to launch the `spark-kubernetes-driver`.  however, it 
successfully launches this container for the previous builds.
   
   compare a successful launch output to the failed, in 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/7648/artifact/resource-managers/kubernetes/integration-tests/target/integration-tests.log:
   
   successful:
   ```
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils: 2019-01-31 11:35:51 INFO  LoggingPodStatusWatcherImpl:57 - State 
changed, new state: 
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       pod name: spark-test-app-864b470da34848bfa3b7639afd063d1a
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       namespace: 629aed1fc5aa43459461481981ea34e1
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       labels: spark-app-locator -> 
4433c6b598624c498096954aad041402, spark-app-selector -> 
spark-2bbc2c2903be49808430cd2ac97c75e4, spark-role -> driver, 
template-label-key -> driver-template-label-value
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       pod uid: 6b02e7b4-258f-11e9-8053-442018636347
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       creation time: 2019-01-31T19:35:51Z
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       service account name: default
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       volumes: spark-local-dir-1, pod-template-volume, 
spark-conf-volume, default-token-vtv9b
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       node name: minikube
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       start time: 2019-01-31T19:35:51Z
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       phase: Pending
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       container status: 
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:               container name: test-driver-container   <---- LOOK, 
IT WORKED!
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:               container image: 
docker.io/kubespark/spark:7d7f1fb1-b9cf-415c-a885-f8b79b525b83
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:               container state: waiting
   19/01/31 11:35:51.860 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:               pending reason: ContainerCreating
   19/01/31 11:35:52.036 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils: 2019-01-31 11:35:52 INFO  Client:57 - Deployed Spark application 
spark-test-app into Kubernetes.
   19/01/31 11:35:52.054 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils: 2019-01-31 11:35:52 INFO  ShutdownHookManager:57 - Shutdown hook 
called
   19/01/31 11:35:52.055 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils: 2019-01-31 11:35:52 INFO  ShutdownHookManager:57 - Deleting 
directory /tmp/spark-50a3bf8d-6584-492b-95da-2fffe26906b6
   19/01/31 11:35:52.191 ScalaTest-main-running-KubernetesSuite INFO 
KubernetesSuite: Beginning watch of executors
   19/01/31 11:36:00.253 ScalaTest-main-running-KubernetesSuite INFO 
KubernetesSuite: Ending watch of executors
   19/01/31 11:36:08.443 ScalaTest-main-running-KubernetesSuite INFO 
KubernetesSuite: 
   
   ===== FINISHED o.a.s.deploy.k8s.integrationtest.KubernetesSuite: 'Start pod 
creation from template' =====
   ```
   
   failed (pay attention to the timestamps):
   ```
   19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils: 2019-01-31 11:36:19 INFO  LoggingPodStatusWatcherImpl:57 - State 
changed, new state: 
   19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       pod name: spark-test-app-fd05666a014749b5beb04dc194e39b6f
   19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       namespace: 629aed1fc5aa43459461481981ea34e1
   19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       labels: spark-app-locator -> 
139da53bf4a34d0295cc3c2c8fb58baa, spark-app-selector -> 
spark-bee7d4dac4ac4a2ebd54a59d0856e1c3, spark-role -> driver
   19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       pod uid: 7bb7504a-258f-11e9-8053-442018636347
   19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       creation time: 2019-01-31T19:36:19Z
   19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       service account name: default
   19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       volumes: spark-local-dir-1, data, spark-conf-volume, 
default-token-hhmbr
   19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       node name: N/A
   19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       start time: N/A
   19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       phase: Pending
   19/01/31 11:36:19.896 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:       container status: N/A   <----  LOOK IT DIDN'T WORK!
   19/01/31 11:36:20.033 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils: 2019-01-31 11:36:20 INFO  Client:57 - Deployed Spark application 
spark-test-app into Kubernetes.
   19/01/31 11:36:20.040 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils: 2019-01-31 11:36:20 INFO  ShutdownHookManager:57 - Shutdown hook 
called
   19/01/31 11:36:20.041 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils: 2019-01-31 11:36:20 INFO  ShutdownHookManager:57 - Deleting 
directory /tmp/spark-264248f2-a858-4837-adea-e459df8a027f
   19/01/31 11:36:20.231 OkHttp https://192.168.39.230:8443/... ERROR 
ExecWebSocketListener: Exec Failure: HTTP:400. Message:pod 
spark-test-app-fd05666a014749b5beb04dc194e39b6f does not have a host assigned
   java.net.ProtocolException: Expected HTTP 101 response but was '400 Bad 
Request'
        at 
okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:219)
        at 
okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:186)
        at okhttp3.RealCall$AsyncCall.execute(RealCall.java:153)
        at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   ```
   note how we don't see that the container is running when the pod is 
initialized.
   
   then, a few seconds later the http response changes from 400 to 500:
   ```
   19/01/31 11:36:20.767 OkHttp https://192.168.39.230:8443/... ERROR 
ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found 
("spark-kubernetes-driver")
   java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal 
Server Error'
   ```
   
   meanwhile, in the output from `kubectl 
--namespace=629aed1fc5aa43459461481981ea34e1 get pods --show-all`, i see the 
pod status for `spark-test-app-fd05666a014749b5beb04dc194e39b6f` change at the 
following timestamps:
   
   Thu Jan 31 11:36:19 PST 2019:  Pending
   Thu Jan 31 11:36:20 PST 2019:  ContainerCreating
   Thu Jan 31 11:36:21 PST 2019:  Running
   Thu Jan 31 11:36:24 PST 2019:  Error
   
   cross-checking the `integration-tests.log` output on jenkins for the above 
timestamps shows that once the pod is marked as `Running`, we switch from the 
400 to 500 error.  two minutes later, the test times out and fails.
   
   so basically, what i think is happening is that the PVTestSuite isn't 
actually launching the `spark-kubernetes-driver`  container (duh), and for some 
reason the `driverPodChecker` call in `runDFSReadWriteAndVerifyCompletion` 
isn't catching the fact that the container isn't running.
   
   thoughts?  i can't dive much deeper in to the scala debugging as the extent 
of my experience there is an online tutorial a couple of years back.  ;)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to