[ https://issues.apache.org/jira/browse/FLINK-10193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588512#comment-16588512 ]
Ufuk Celebi commented on FLINK-10193: ------------------------------------- [~gjy] I've changed the type of ticket from {{Improvement}} to {{Bug}} as this results in savepoints that take longer than the default ask timeout to be reported as {{COMPLETED}} with a {{failure-cause}} although the actual savepoint completes successfully: {code} { "status": { "id": "COMPLETED" }, "operation": { "failure-cause": { "class": "java.util.concurrent.CompletionException", "stack-trace": "java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager_0#42163687]] after [30000 ms]. Sender[null] sent message of type \"org.apache.flink.runtime.rpc.messages.LocalFencedMessage\".\n\tat java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)\n\tat java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)\n\tat java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)\n\tat java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)\n\tat java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)\n\tat java.util.concurrent.CompletableFuture.completeExceptionally...<ommitted for brevity>" } } } {code} This means that we can't use the REST API to reliably trigger savepoints. I've verified this with a small program that blocks during checkpoints for a configurable amount of time. The only workaround as far as I know is to increase {{akka.ask.timeout}} (although I would not recommend this as it affects other things as well). Note that increasing {{web.timeout}} does not affect this. > Default RPC timeout is used when triggering savepoint via JobMasterGateway > -------------------------------------------------------------------------- > > Key: FLINK-10193 > URL: https://issues.apache.org/jira/browse/FLINK-10193 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination > Affects Versions: 1.5.3, 1.6.0 > Reporter: Gary Yao > Assignee: Gary Yao > Priority: Critical > > When calling {{JobMasterGateway#triggerSavepoint(String, boolean, Time)}}, > the default timeout is used because the time parameter of the method is not > annotated with {{@RpcTimeout}}. > *Expected behavior* > * timeout for the RPC should be {{RpcUtils.INF_TIMEOUT}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)