[ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16474770#comment-16474770
 ] 

Anirudh Ramanathan commented on SPARK-24266:
--------------------------------------------

Thanks for filing this issue and digging in. It does seem like the client 
should be responsible for reconnecting transparently when it sees this 
particular error, because the compaction has run and the window of valid 
resource versions has changed. I'd double-check with what the go-client does. 
If it does reconnect transparently (as I suspect it does), we should change the 
fabric8 client-side logic to do the same.

> Spark client terminates while driver is still running
> -----------------------------------------------------
>
>                 Key: SPARK-24266
>                 URL: https://issues.apache.org/jira/browse/SPARK-24266
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.3.0
>            Reporter: Chun Chen
>            Priority: Major
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties 
> included when running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: 
> [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
> Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
> Mounting Hadoop specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>        pod name: spark-64-293-980-1526021412180-driver
>        namespace: tione-603074457
>        labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>        pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>        creation time: 2018-05-11T06:50:17Z
>        service account name: default
>        volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>        node name: N/A
>        start time: N/A
>        container images: N/A
>        phase: Pending
>        status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>        pod name: spark-64-293-980-1526021412180-driver
>        namespace: tione-603074457
>        labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>        pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>        creation time: 2018-05-11T06:50:17Z
>        service account name: default
>        volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>        node name: tbds-100-98-45-69
>        start time: N/A
>        container images: N/A
>        phase: Pending
>        status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>        pod name: spark-64-293-980-1526021412180-driver
>        namespace: tione-603074457
>        labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>        pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>        creation time: 2018-05-11T06:50:17Z
>        service account name: default
>        volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>        node name: tbds-100-98-45-69
>        start time: 2018-05-11T06:50:17Z
>        container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>        phase: Pending
>        status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
> finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>        pod name: spark-64-293-980-1526021412180-driver
>        namespace: tione-603074457
>        labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>        pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>        creation time: 2018-05-11T06:50:17Z
>        service account name: default
>        volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>        node name: tbds-100-98-45-69
>        start time: 2018-05-11T06:50:17Z
>        container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>        phase: Pending
>        status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:27 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>        pod name: spark-64-293-980-1526021412180-driver
>        namespace: tione-603074457
>        labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>        pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>        creation time: 2018-05-11T06:50:17Z
>        service account name: default
>        volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>        node name: tbds-100-98-45-69
>        start time: 2018-05-11T06:50:17Z
>        container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>        phase: Pending
>        status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:28 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>        pod name: spark-64-293-980-1526021412180-driver
>        namespace: tione-603074457
>        labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>        pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>        creation time: 2018-05-11T06:50:17Z
>        service account name: default
>        volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>        node name: tbds-100-98-45-69
>        start time: 2018-05-11T06:50:17Z
>        container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>        phase: Running
>        status: 
> [ContainerStatus(containerID=docker://e2dbfbe5dd1e9c0e0e3ac045f0e17c5c0c80e684b273ca01daf80d908a1e1368,
>  image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, 
> imageID=docker://dfa88515def0ade6874ba13ce40a7073fbdb84eab23b47b3237853e93bde9a02,
>  lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=true, 
> restartCount=0, 
> state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2018-05-11T06:50:27Z,
>  additionalProperties={}), additionalProperties={}), terminated=null, 
> waiting=null, additionalProperties={}), additionalProperties={})]
> 18/05/11 15:40:28 WARN WatchConnectionManager: Exec Failure
> java.io.EOFException
>       at okio.RealBufferedSource.require(RealBufferedSource.java:60)
>       at okio.RealBufferedSource.readByte(RealBufferedSource.java:73)
>       at 
> okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
>       at 
> okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
>       at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)
>       at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)
>       at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
>       at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> 18/05/11 15:40:28 INFO WatchConnectionManager: Current reconnect backoff is 
> 1000 milliseconds (T0)
> 18/05/11 16:30:30 WARN WatchConnectionManager: Exec Failure
> java.io.EOFException
>       at okio.RealBufferedSource.require(RealBufferedSource.java:60)
>       at okio.RealBufferedSource.readByte(RealBufferedSource.java:73)
>       at 
> okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
>       at 
> okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
>       at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)
>       at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)
>       at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
>       at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> 18/05/11 16:30:30 INFO WatchConnectionManager: Current reconnect backoff is 
> 1000 milliseconds (T0)
> 18/05/11 16:30:31 INFO LoggingPodStatusWatcherImpl: Container final statuses:
>        Container name: spark-kubernetes-driver
>        Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>        Container state: Running
>        Container started at: 2018-05-11T06:50:27Z
> 18/05/11 16:30:31 INFO Client: Application spark-64-293-980 finished.
> 18/05/11 16:30:31 INFO ShutdownHookManager: Shutdown hook called
> 18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory 
> /tmp/uploaded-files-ee69f033-7db5-4803-ba35-835f5e3caaeb
> 18/05/11 16:30:31 INFO ShutdownHookManager: Deleting directory 
> /tmp/uploaded-jars-4ab0fbfd-be75-40e3-a3fd-49a7f20b544f
> {code}
> This is the client log when we encountered this issue. Spark client 
> terminates while driver is still running. See the last few lines in the log 
> {code}
> 18/05/11 16:30:30 INFO WatchConnectionManager: Current reconnect backoff is 
> 1000 milliseconds (T0)
> 18/05/11 16:30:31 INFO LoggingPodStatusWatcherImpl: Container final statuses:
>        Container name: spark-kubernetes-driver
>        Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>        Container state: Running
>        Container started at: 2018-05-11T06:50:27Z
> {code}
> After debuging, I add some logs both in spark and kubernetes-client. 
> {code}
> # spark patch
> --- 
> a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala
> +++ 
> b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala
> @@ -69,6 +69,7 @@ private[k8s] class LoggingPodStatusWatcherImpl(
>  
>    override def eventReceived(action: Action, pod: Pod): Unit = {
>      this.pod = Option(pod)
> +    logInfo(s"action $action")
>      action match {
>        case Action.DELETED =>
>          podFinalStatus = POD_FAILED
> @@ -92,7 +93,7 @@ private[k8s] class LoggingPodStatusWatcherImpl(
>    }
>  
>    override def onClose(e: KubernetesClientException): Unit = {
> -    logDebug(s"Stopping watching application $appId with last-observed phase 
> $phase")
> +    logInfo(s"Stopping watching application $appId with last-observed phase 
> $phase, exception $e")
>      closeWatch()
>    }
> # kubernetes-client
> diff --git 
> a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
>  
> b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
> index b82b64e6..20587ef0 100644
> --- 
> a/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
> +++ 
> b/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java
> @@ -129,6 +129,7 @@ public class WatchConnectionManager<T extends 
> HasMetadata, L extends KubernetesR
>          httpUrlBuilder.addQueryParameter("fieldSelector", fieldQueryString);
>        }
>      }
> +    logger.info("resourceVersion " + resourceVersion.get());
>  
>      if (this.resourceVersion.get() != null) {
>        httpUrlBuilder.addQueryParameter("resourceVersion", 
> this.resourceVersion.get());
> @@ -214,12 +215,14 @@ public class WatchConnectionManager<T extends 
> HasMetadata, L extends KubernetesR
>          try {
>            WatchEvent event = readWatchEvent(message);
>            Object object = event.getObject();
> +          logger.info("resourceVersion " + resourceVersion.get());
>            if (object instanceof HasMetadata) {
>              @SuppressWarnings("unchecked")
>              T obj = (T) object;
>              // Dirty cast - should always be valid though
>              String currentResourceVersion = resourceVersion.get();
>              String newResourceVersion = ((HasMetadata) 
> obj).getMetadata().getResourceVersion();
> +            logger.info("currentResourceVersion " + currentResourceVersion + 
> " newResourceVersion " + newResourceVersion);
>              if (currentResourceVersion == null || 
> currentResourceVersion.compareTo(newResourceVersion) < 0) {
>                resourceVersion.compareAndSet(currentResourceVersion, 
> newResourceVersion);
>              }
> @@ -244,6 +247,7 @@ public class WatchConnectionManager<T extends 
> HasMetadata, L extends KubernetesR
>              }
>            } else if (object instanceof Status) {
>              Status status = (Status) object;
> +            logger.info("status " + status.toString());
>  
>              // The resource version no longer exists - this has to be 
> handled by the caller.
>              if (status.getCode() == HTTP_GONE) {
> {code}
> It clearly print the following root errors. The issue lays in 
> kubernetes-client that it simply throws the "too old resource version" 
> exception 
> (https://github.com/fabric8io/kubernetes-client/blob/5b1a57b64c7dcc7ebeba3a7024e8615c91afaedb/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java#L259-L266)
>  to client, but spark client does not handle this.
> We configured etcd to do compack in each hour. If a job runs for more than an 
> hour, it may encounter this issue.
> {code}
> 18/05/13 13:32:48 INFO WatchConnectionManager: Current reconnect backoff is 
> 32000 milliseconds (T5)
> 18/05/13 13:33:20 INFO WatchConnectionManager: resourceVersion 21648111
> 18/05/13 13:33:20 INFO WatchConnectionManager: status Status(apiVersion=v1, 
> code=410, details=null, kind=Status, message=too old resource version: 
> 21648111 (21653211), metadata=ListMeta(resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Gone, status=Failure, 
> additionalProperties={})
> 18/05/13 13:33:20 INFO LoggingPodStatusWatcherImpl: Stopping watching 
> application spark-6bb7b853297a4892b9fb95a3a0ac0b3d with last-observed phase 
> Running, exception io.fabric8.kubernetes.client.KubernetesClientException: 
> too old resource version: 21648111 (21653211)
> 18/05/13 13:33:20 INFO LoggingPodStatusWatcherImpl: Container final statuses:
>        Container name: spark-kubernetes-driver
>        Container image: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>        Container state: Running
>        Container started at: 2018-05-13T05:21:01Z
> 18/05/13 13:33:20 INFO Client: Application spark-64-293-980 finished.
> 18/05/13 13:33:20 INFO ShutdownHookManager: Shutdown hook called
> 18/05/13 13:33:20 INFO ShutdownHookManager: Deleting directory 
> /tmp/uploaded-jars-9a02d858-73e9-476b-9063-2fcdd9b083a2
> 18/05/13 13:33:20 INFO ShutdownHookManager: Deleting directory 
> /tmp/uploaded-files-f82b5887-6edf-4707-8703-1ccae2114c6f
> {code}
> I fixed the issue by hacking kubernetes-client, make it watch without 
> specifying the resource version. I'm thinking of a better solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to