[
https://issues.apache.org/jira/browse/FLINK-12385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868467#comment-16868467
]
Matt Dailey commented on FLINK-12385:
-------------------------------------
I was not able to get jobmanager debug logs for when the problem occurred, but
I think we did find what caused it in our environment.
We were rolling out Istio on Kubernetes, and our best bet is that the client
hung when communicating with ZooKeeper because we had a problem where we
accidentally defined two Kubernetes services for ZooKeeper, which Istio did not
handle well. We had seen similar problems where clients would hang when
connecting to services defined that way.
And that's right, this was in detached mode.
And thanks for the explanation, I think you're right, the underlying connection
should hit its timeout and retry limits to and exit from the future, so adding
a timeout to the future is probably not the right solution
> RestClusterClient can hang indefinitely during job submission
> -------------------------------------------------------------
>
> Key: FLINK-12385
> URL: https://issues.apache.org/jira/browse/FLINK-12385
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Runtime / REST
> Affects Versions: 1.8.0
> Reporter: Matt Dailey
> Priority: Minor
>
> We have had situations where clients would hang indefinitely during job
> submission, even when job submission would succeed. We have not yet
> characterized what happened on the server to cause this, but we thought that
> the client should have a timeout for these requests.
> This was observed in Flink 1.5.5, but the code seems to still have this
> problem in 1.8.0. One option is to include a timeout in calls to
> {{CompletableFuture.get()}}:
> * [RestClusterClient in
> 1.5.5|https://github.com/apache/flink/blob/release-1.5.5/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L246]
> * [RestClusterClient in
> 1.8.0|https://github.com/apache/flink/blob/release-1.8.0/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L247]
> Thread dump from client running Flink 1.5.5, running in Java 8:
> {noformat}
> http-nio-0.0.0.0-8443-exec-6" #34 daemon prio=5 os_prio=0
> tid=0x000055b421fd2000 nid=0x29 waiting on condition [0x00007f932e176000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000000b331d7c0> (a
> java.util.concurrent.CompletableFuture$Signaller)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> at
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:246)
> at
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
> at
> org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
> at
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:410)
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)