StephanEwen opened a new pull request #11048: Ask timeouts
URL: https://github.com/apache/flink/pull/11048
 
 
   ## What is the purpose of the change
   
   This change preserves the call stack from the invocation of an RPC ask() 
calls. When the call fails, this allows us to add a meaningful exception 
pointing to the original call site, rather than having just some weird stack 
trace from akka.
   
   *Note:* While it is possible to do this for all types of ask() failures, 
this PR uses this approach for now only for timeouts.
   
   Former exception message and stack trace:
   ```
   akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka://flink/user/test_name#-585757974]] after [1 ms]. Message of type 
[org.apache.flink.runtime.rpc.messages.LocalRpcInvocation]. A typical reason 
for `AskTimeoutException` is that the recipient actor didn't send a reply.
        at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
        at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
        at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
        at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
        at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
        at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
        at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
        at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
        at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
        at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
        at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
        at java.base/java.lang.Thread.run(Thread.java:834)
   ```
   New exceptions and stack trace:
   ```
   java.util.concurrent.TimeoutException: Invocation of 
java.util.concurrent.CompletableFuture 
org.apache.flink.runtime.rpc.akka.TimeoutCallStackTest$TestingGateway.callThatTimesOut(org.apache.flink.api.common.time.Time)
 timed out.
        at 
org.apache.flink.runtime.rpc.akka.TimeoutCallStackTest.testTimeoutException(TimeoutCallStackTest.java:87)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
        ...
   ```
   
   ## Brief change log
   
     - Add a new config option `akka.ask.callstack` (default: true)
     - Capture call stack at call site
     - Replace exception on timeout
   
   ## Verifying this change
   
   You can verify this with any Flink deployment where something times out, for 
example deploy/cancel calls when a TaskManager is killed
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): **no**
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: **no**
     - The serializers: **no**
     - The runtime per-record code paths (performance sensitive): **no**
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: **no**
     - The S3 file system connector: **no**
   
   ## Documentation
   
     - Does this pull request introduce a new feature? **yes**
     - If yes, how is the feature documented? **config docs**
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to