[
https://issues.apache.org/jira/browse/FLINK-24113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459414#comment-17459414
]
Robert Metzger commented on FLINK-24113:
----------------------------------------
Thanks a lot for addressing this feature request [~chesnay] and [~Nicolaus
Weidner].
While using it, I observed that the cluster shutdown sometimes gets stuck, if
triggered by the REST API. It works when the cluster shutdown is initiated by a
job cancellation (in Application Mode), I haven't observed this issue yet.
Here's where I believe the shutdown gets stuck:
{code}
"AkkaRpcService-Supervisor-Termination-Future-Executor-thread-1" #89 daemon
prio=5 os_prio=0 tid=0x0000004017d70000 nid=0x2ec in Object.wait()
[0x000000402a9b5000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000d6c48368> (a
org.apache.flink.runtime.blob.BlobServer)
at java.lang.Thread.join(Thread.java:1252)
- locked <0x00000000d6c48368> (a
org.apache.flink.runtime.blob.BlobServer)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.flink.runtime.blob.BlobServer.close(BlobServer.java:319)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.stopClusterServices(ClusterEntrypoint.java:406)
- locked <0x00000000d5d27350> (a java.lang.Object)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$shutDownAsync$4(ClusterEntrypoint.java:505)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint$$Lambda$1113/1220951830.get(Unknown
Source)
at
org.apache.flink.util.concurrent.FutureUtils.lambda$composeAfterwards$20(FutureUtils.java:728)
at
org.apache.flink.util.concurrent.FutureUtils$$Lambda$1083/1178655216.accept(Unknown
Source)
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at
org.apache.flink.util.concurrent.FutureUtils.lambda$null$19(FutureUtils.java:739)
at
org.apache.flink.util.concurrent.FutureUtils$$Lambda$1088/1499303232.accept(Unknown
Source)
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at
org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent.lambda$closeAsyncInternal$2(DispatcherResourceManagerComponent.java:198)
at
org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent$$Lambda$1133/525033897.accept(Unknown
Source)
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at
org.apache.flink.util.concurrent.FutureUtils$CompletionConjunctFuture.completeFuture(FutureUtils.java:1000)
- locked <0x00000000c14d6000> (a java.lang.Object)
at
org.apache.flink.util.concurrent.FutureUtils$CompletionConjunctFuture$$Lambda$544/1791014677.accept(Unknown
Source)
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at
org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1389)
at
org.apache.flink.util.concurrent.FutureUtils.lambda$forwardTo$24(FutureUtils.java:1372)
at
org.apache.flink.util.concurrent.FutureUtils$$Lambda$599/1004862656.accept(Unknown
Source)
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at
org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1389)
at
org.apache.flink.util.concurrent.FutureUtils.lambda$forwardTo$24(FutureUtils.java:1372)
at
org.apache.flink.util.concurrent.FutureUtils$$Lambda$599/1004862656.accept(Unknown
Source)
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils$$Lambda$589/953925250.run(Unknown
Source)
at
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
at
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$withContextClassLoader$0(ClassLoadingUtils.java:41)
at
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils$$Lambda$585/1952194564.run(Unknown
Source)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
I'll attach the full log and continue investigating. Once we've understood the
issue, I'm happy to create a separate ticket.
> Introduce option in Application Mode to disable shutdown
> --------------------------------------------------------
>
> Key: FLINK-24113
> URL: https://issues.apache.org/jira/browse/FLINK-24113
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.15.0
> Reporter: Robert Metzger
> Assignee: Nicolaus Weidner
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: shutdown_issue.log
>
>
> Currently a Flink JobManager started in Application Mode will shut down once
> the job has completed.
> When doing a "stop with savepoint" operation, we want to keep the JobManager
> alive after the job has stopped to retrieve and persist the final savepoint
> location.
> Currently, Flink waits up to 5 minutes and then shuts down.
> I'm proposing to introduce a new configuration flag "application mode
> shutdown behavior": "keepalive" (naming things is hard ;) ) which will keep
> the JobManager in ApplicationMode running until a REST call confirms that it
> can shutdown.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)