StephanEwen opened a new pull request #11048: Ask timeouts URL: https://github.com/apache/flink/pull/11048 ## What is the purpose of the change This change preserves the call stack from the invocation of an RPC ask() calls. When the call fails, this allows us to add a meaningful exception pointing to the original call site, rather than having just some weird stack trace from akka. *Note:* While it is possible to do this for all types of ask() failures, this PR uses this approach for now only for timeouts. Former exception message and stack trace: ``` akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/test_name#-585757974]] after [1 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalRpcInvocation]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply. at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648) at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328) at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279) at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283) at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235) at java.base/java.lang.Thread.run(Thread.java:834) ``` New exceptions and stack trace: ``` java.util.concurrent.TimeoutException: Invocation of java.util.concurrent.CompletableFuture org.apache.flink.runtime.rpc.akka.TimeoutCallStackTest$TestingGateway.callThatTimesOut(org.apache.flink.api.common.time.Time) timed out. at org.apache.flink.runtime.rpc.akka.TimeoutCallStackTest.testTimeoutException(TimeoutCallStackTest.java:87) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) ... ``` ## Brief change log - Add a new config option `akka.ask.callstack` (default: true) - Capture call stack at call site - Replace exception on timeout ## Verifying this change You can verify this with any Flink deployment where something times out, for example deploy/cancel calls when a TaskManager is killed ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): **no** - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **no** - The serializers: **no** - The runtime per-record code paths (performance sensitive): **no** - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: **no** - The S3 file system connector: **no** ## Documentation - Does this pull request introduce a new feature? **yes** - If yes, how is the feature documented? **config docs**
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
