[ 
https://issues.apache.org/jira/browse/FLINK-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156363#comment-17156363
 ] 

Steven Zhen Wu edited comment on FLINK-11143 at 7/14/20, 5:38 PM:
------------------------------------------------------------------

[~trohrmann] I am seeing a similar problem *when trying unaligned checkpoint 
with 1.11.0*. The Flink job actually started fine. We didn't see this 
AskTimeoutException thrown during job submission in without unaligned 
checkpoint (1.10 or 1.11).

Some more context about the app
 * a large-state stream join app (a few TBs)
 * parallelism 1,440
 * number of containers: 180
 * Cores per container: 12
 * TM_TASK_SLOTS: 8
 * akka.ask.timeout: 120 s
 * heartbeat.timeout: 120000
 * web.timeout: 60000 (also tried larger values like 300,000 or 600,000 without 
any difference)

I will send you the log files (with DEBUG level) in an email offline. Thanks a 
lot for your help in advance!
{code:java}
\"errors\":[\"Internal server error.\",\"<Exception on server 
side:norg.apache.flink.util.FlinkRuntimeException: Could not execute 
application.\\ntat 
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:81)\\ntat
 
org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:67)\\ntat
 
org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:99)\\ntat
 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\\ntat
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\\ntat 
java.util.concurrent.FutureTask.run(FutureTask.java:266)\\ntat 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\\ntat
 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\\ntat
 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\ntat
 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\ntat
 java.lang.Thread.run(Thread.java:748)nCaused by: 
org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error: Failed to execute job 'my-job-alt'.\\ntat 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)\\ntat
 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)\\ntat
 org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)\\ntat 
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:78)\\nt...
 10 morenCaused by: org.apache.flink.util.FlinkException: Failed to execute job 
'my-job-alt'.\\ntat 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1823)\\ntat
 
org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)\\ntat
 
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)\\ntat
 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1699)\\ntat
 
com.foo.bar.application.SpaasBaseApplication.execute(SpaasBaseApplication.java:54)\\ntat
 
com.foo.bar.paa.streaming.impressions.ImpressionsJobMain$.main(ImpressionsJobMain.scala:12)\\ntat
 
com.foo.bar.paa.streaming.impressions.ImpressionsJobMain.main(ImpressionsJobMain.scala)\\ntat
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\\ntat 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\\ntat
 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\ntat
 java.lang.reflect.Method.invoke(Method.java:498)\\ntat 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)\\nt...
 13 morenCaused by: java.util.concurrent.TimeoutException: Invocation of public 
abstract java.util.concurrent.CompletableFuture 
org.apache.flink.runtime.dispatcher.DispatcherGateway.submitJob(org.apache.flink.runtime.jobgraph.JobGraph,org.apache.flink.api.common.time.Time)
 timed out.\\ntat com.sun.proxy.$Proxy113.submitJob(Unknown Source)\\ntat 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.lambda$submitJob$4(EmbeddedExecutor.java:158)\\ntat
 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)\\ntat
 
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)\\ntat
 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitJob(EmbeddedExecutor.java:158)\\ntat
 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitAndGetJobClientFuture(EmbeddedExecutor.java:119)\\ntat
 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.execute(EmbeddedExecutor.java:98)\\ntat
 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1812)\\nt...
 24 morenCaused by: akka.pattern.AskTimeoutException: Ask timed out on 
Actor[akka://flink/user/rpc/dispatcher_1#-283770831] after [60000 ms]. Message 
of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
reason for `AskTimeoutException` is that the recipient actor didn't send a 
reply.\\ntat 
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\ntat 
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\ntat 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)\\ntat
 akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)\\ntat 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)\\ntat
 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)\\ntat
 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)\\ntat
 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\\ntat
 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\\ntat
 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\\ntat
 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\\nt...
 1 more{code}
 


was (Author: stevenz3wu):
[~trohrmann] I am seeing a similar problem *when trying unaligned checkpoint 
with 1.11.0*. The Flink job actually started fine. We didn't see this 
AskTimeoutException thrown during job submission in without unaligned 
checkpoint (1.10 or 1.11).

Some more context about the app
 * a large-state stream join app (a few TBs)
 * parallelism 1,440
 * number of containers: 180
 * Cores per container: 12
 * TM_TASK_SLOTS: 8
 * akka.ask.timeout: 120 s
 * heartbeat.timeout: 120000
 * web.timeout: 60000 (also tried larger values like 300,000 or 600,000 without 
any difference)

I will send you the log files (with DEBUG level) in an email offline. Thanks a 
lot for your help in advance!
{code:java}
\"errors\":[\"Internal server error.\",\"<Exception on server 
side:\\norg.apache.flink.util.FlinkRuntimeException: Could not execute 
application.\\n\\tat 
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:81)\\n\\tat
 
org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:67)\\n\\tat
 
org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:99)\\n\\tat
 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\\n\\tat
 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\\n\\tat 
java.util.concurrent.FutureTask.run(FutureTask.java:266)\\n\\tat 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\\n\\tat
 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\\n\\tat
 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat
 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat
 java.lang.Thread.run(Thread.java:748)\\nCaused by: 
org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error: Failed to execute job 'my-job-alt'.\\n\\tat 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)\\n\\tat
 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)\\n\\tat
 
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)\\n\\tat
 
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:78)\\n\\t...
 10 more\\nCaused by: org.apache.flink.util.FlinkException: Failed to execute 
job 'my-job-alt'.\\n\\tat 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1823)\\n\\tat
 
org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)\\n\\tat
 
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)\\n\\tat
 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1699)\\n\\tat
 
com.foo.bar.application.SpaasBaseApplication.execute(SpaasBaseApplication.java:54)\\n\\tat
 
com.foo.bar.paa.streaming.impressions.ImpressionsJobMain$.main(ImpressionsJobMain.scala:12)\\n\\tat
 
com.foo.bar.paa.streaming.impressions.ImpressionsJobMain.main(ImpressionsJobMain.scala)\\n\\tat
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\\n\\tat 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\\n\\tat
 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n\\tat
 java.lang.reflect.Method.invoke(Method.java:498)\\n\\tat 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)\\n\\t...
 13 more\\nCaused by: java.util.concurrent.TimeoutException: Invocation of 
public abstract java.util.concurrent.CompletableFuture 
org.apache.flink.runtime.dispatcher.DispatcherGateway.submitJob(org.apache.flink.runtime.jobgraph.JobGraph,org.apache.flink.api.common.time.Time)
 timed out.\\n\\tat com.sun.proxy.$Proxy113.submitJob(Unknown Source)\\n\\tat 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.lambda$submitJob$4(EmbeddedExecutor.java:158)\\n\\tat
 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)\\n\\tat
 
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)\\n\\tat
 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitJob(EmbeddedExecutor.java:158)\\n\\tat
 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitAndGetJobClientFuture(EmbeddedExecutor.java:119)\\n\\tat
 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.execute(EmbeddedExecutor.java:98)\\n\\tat
 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1812)\\n\\t...
 24 more\\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka://flink/user/rpc/dispatcher_1#-283770831]] after [60000 ms]. 
Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A 
typical reason for `AskTimeoutException` is that the recipient actor didn't 
send a reply.\\n\\tat 
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat 
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)\\n\\tat
 akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)\\n\\tat 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)\\n\\tat
 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)\\n\\tat
 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)\\n\\tat
 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\\n\\tat
 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\\n\\tat
 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\\n\\tat
 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\\n\\t...
 1 more\\n\{code}

> AskTimeoutException is thrown during job submission and completion
> ------------------------------------------------------------------
>
>                 Key: FLINK-11143
>                 URL: https://issues.apache.org/jira/browse/FLINK-11143
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.6.2, 1.10.0
>            Reporter: Alex Vinnik
>            Priority: Critical
>         Attachments: flink-job-timeline.PNG
>
>
> For more details please see the thread
> [http://mail-archives.apache.org/mod_mbox/flink-user/201812.mbox/%3cc2fb26f9-1410-4333-80f4-34807481b...@gmail.com%3E]
> On submission 
> 2018-12-12 02:28:31 ERROR JobsOverviewHandler:92 - Implementation error: 
> Unhandled exception.
>  akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka://flink/user/dispatcher#225683351|#225683351]] after [10000 ms]. 
> Sender[null] sent message of type 
> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
>  at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
>  at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>  at java.lang.Thread.run(Thread.java:748)
>  
> On completion
>  
> {"errors":["Internal server error.","<Exception on server 
> side:\njava.util.concurrent.CompletionException: 
> akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka://flink/user/dispatcher#105638574]] after [10000 ms]. 
> Sender[null] sent message of type 
> \"org.apache.flink.runtime.rpc.messages.LocalFencedMessage\".
> at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:772)
> at akka.dispatch.OnComplete.internal(Future.scala:258)
> at akka.dispatch.OnComplete.internal(Future.scala:256)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
> at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
> at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
> at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
> at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
> at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
> at java.lang.Thread.run(Thread.java:748)\nCaused by: 
> akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka://flink/user/dispatcher#105638574]] after [10000 ms]. 
> Sender[null] sent message of type 
> \"org.apache.flink.runtime.rpc.messages.LocalFencedMessage\".
> at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)\n\t...
>  9 more\n\nEnd of exception on server side>"]}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to