[jira] [Commented] (FLINK-12385) RestClusterClient can hang indefinitely during job submission

Matt Dailey (JIRA) Thu, 20 Jun 2019 04:56:46 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868467#comment-16868467
 ]


Matt Dailey commented on FLINK-12385:
-------------------------------------

I was not able to get jobmanager debug logs for when the problem occurred, but 
I think we did find what caused it in our environment.

We were rolling out Istio on Kubernetes, and our best bet is that the client 
hung when communicating with ZooKeeper because we had a problem where we 
accidentally defined two Kubernetes services for ZooKeeper, which Istio did not 
handle well.  We had seen similar problems where clients would hang when 
connecting to services defined that way.

And that's right, this was in detached mode.

And thanks for the explanation, I think you're right, the underlying connection 
should hit its timeout and retry limits to and exit from the future, so adding 
a timeout to the future is probably not the right solution

> RestClusterClient can hang indefinitely during job submission
> -------------------------------------------------------------
>
>                 Key: FLINK-12385
>                 URL: https://issues.apache.org/jira/browse/FLINK-12385
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / REST
>    Affects Versions: 1.8.0
>            Reporter: Matt Dailey
>            Priority: Minor
>
> We have had situations where clients would hang indefinitely during job 
> submission, even when job submission would succeed. We have not yet 
> characterized what happened on the server to cause this, but we thought that 
> the client should have a timeout for these requests.
> This was observed in Flink 1.5.5, but the code seems to still have this 
> problem in 1.8.0. One option is to include a timeout in calls to 
> {{CompletableFuture.get()}}:
>  * [RestClusterClient in 
> 1.5.5|https://github.com/apache/flink/blob/release-1.5.5/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L246]
>  * [RestClusterClient in 
> 1.8.0|https://github.com/apache/flink/blob/release-1.8.0/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L247]
> Thread dump from client running Flink 1.5.5, running in Java 8:
> {noformat}
> http-nio-0.0.0.0-8443-exec-6" #34 daemon prio=5 os_prio=0 
> tid=0x000055b421fd2000 nid=0x29 waiting on condition [0x00007f932e176000]
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00000000b331d7c0> (a 
> java.util.concurrent.CompletableFuture$Signaller)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>       at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
>       at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>       at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
>       at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
>       at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:246)
>       at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
>       at 
> org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
>       at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:410)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12385) RestClusterClient can hang indefinitely during job submission

Reply via email to