[
https://issues.apache.org/jira/browse/FLINK-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156363#comment-17156363
]
Steven Zhen Wu edited comment on FLINK-11143 at 7/14/20, 5:37 PM:
------------------------------------------------------------------
[~trohrmann] I am seeing a similar problem *when trying unaligned checkpoint
with 1.11.0*. The Flink job actually started fine. We didn't see this
AskTimeoutException thrown during job submission in without unaligned
checkpoint (1.10 or 1.11).
Some more context about the app
* a large-state stream join app (a few TBs)
* parallelism 1,440
* number of containers: 180
* Cores per container: 12
* TM_TASK_SLOTS: 8
* akka.ask.timeout: 120 s
* heartbeat.timeout: 120000
* web.timeout: 60000 (also tried larger values like 300,000 or 600,000 without
any difference)
I will send you the log files (with DEBUG level) in an email offline. Thanks a
lot for your help in advance!
{code:java}
\"errors\":[\"Internal server error.\",\"<Exception on server
side:\\norg.apache.flink.util.FlinkRuntimeException: Could not execute
application.\\n\\tat
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:81)\\n\\tat
org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:67)\\n\\tat
org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:99)\\n\\tat
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\\n\\tat
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\\n\\tat
java.util.concurrent.FutureTask.run(FutureTask.java:266)\\n\\tat
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\\n\\tat
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\\n\\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat
java.lang.Thread.run(Thread.java:748)\\nCaused by:
org.apache.flink.client.program.ProgramInvocationException: The main method
caused an error: Failed to execute job
'personalization-streaming-impressions-alt'.\\n\\tat
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)\\n\\tat
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)\\n\\tat
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)\\n\\tat
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:78)\\n\\t...
10 more\\nCaused by: org.apache.flink.util.FlinkException: Failed to execute
job 'personalization-streaming-impressions-alt'.\\n\\tat
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1823)\\n\\tat
org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)\\n\\tat
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)\\n\\tat
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1699)\\n\\tat
com.foo.bar.application.SpaasBaseApplication.execute(SpaasBaseApplication.java:54)\\n\\tat
com.foo.bar.paa.streaming.impressions.ImpressionsJobMain$.main(ImpressionsJobMain.scala:12)\\n\\tat
com.foo.bar.paa.streaming.impressions.ImpressionsJobMain.main(ImpressionsJobMain.scala)\\n\\tat
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\\n\\tat
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\\n\\tat
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n\\tat
java.lang.reflect.Method.invoke(Method.java:498)\\n\\tat
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)\\n\\t...
13 more\\nCaused by: java.util.concurrent.TimeoutException: Invocation of
public abstract java.util.concurrent.CompletableFuture
org.apache.flink.runtime.dispatcher.DispatcherGateway.submitJob(org.apache.flink.runtime.jobgraph.JobGraph,org.apache.flink.api.common.time.Time)
timed out.\\n\\tat com.sun.proxy.$Proxy113.submitJob(Unknown Source)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.lambda$submitJob$4(EmbeddedExecutor.java:158)\\n\\tat
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)\\n\\tat
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitJob(EmbeddedExecutor.java:158)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitAndGetJobClientFuture(EmbeddedExecutor.java:119)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.execute(EmbeddedExecutor.java:98)\\n\\tat
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1812)\\n\\t...
24 more\\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka://flink/user/rpc/dispatcher_1#-283770831]] after [60000 ms].
Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A
typical reason for `AskTimeoutException` is that the recipient actor didn't
send a reply.\\n\\tat
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)\\n\\tat
akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)\\n\\tat
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)\\n\\tat
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)\\n\\tat
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)\\n\\tat
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\\n\\tat
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\\n\\tat
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\\n\\tat
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\\n\\t...
1 more\\n\
{code}
was (Author: stevenz3wu):
[~trohrmann] I am seeing a similar problem *when trying unaligned checkpoint
with 1.11.0*. The Flink job actually started fine. We didn't see this
AskTimeoutException thrown during job submission in without unaligned
checkpoint (1.10 or 1.11).
Some more context about the app
* a large-state stream join app (a few TBs)
* parallelism 1,440
* number of containers: 180
* Cores per container: 12
* TM_TASK_SLOTS: 8
* akka.ask.timeout: 120 s
* heartbeat.timeout: 120000
* web.timeout: 60000 (also tried larger values like 300,000 or 600,000 without
any difference)
I will send you the log files (with DEBUG level) in an email offline. Thanks a
lot for your help in advance!
{code:java}
\"errors\":[\"Internal server error.\",\"<Exception on server
side:\\norg.apache.flink.util.FlinkRuntimeException: Could not execute
application.\\n\\tat
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:81)\\n\\tat
org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:67)\\n\\tat
org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:99)\\n\\tat
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\\n\\tat
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\\n\\tat
java.util.concurrent.FutureTask.run(FutureTask.java:266)\\n\\tat
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\\n\\tat
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\\n\\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat
java.lang.Thread.run(Thread.java:748)\\nCaused by:
org.apache.flink.client.program.ProgramInvocationException: The main method
caused an error: Failed to execute job
'personalization-streaming-impressions-alt'.\\n\\tat
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)\\n\\tat
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)\\n\\tat
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)\\n\\tat
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:78)\\n\\t...
10 more\\nCaused by: org.apache.flink.util.FlinkException: Failed to execute
job 'personalization-streaming-impressions-alt'.\\n\\tat
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1823)\\n\\tat
org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)\\n\\tat
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)\\n\\tat
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1699)\\n\\tat
com.netflix.spaas.application.SpaasBaseApplication.execute(SpaasBaseApplication.java:54)\\n\\tat
com.netflix.dea.paa.streaming.impressions.ImpressionsJobMain$.main(ImpressionsJobMain.scala:12)\\n\\tat
com.netflix.dea.paa.streaming.impressions.ImpressionsJobMain.main(ImpressionsJobMain.scala)\\n\\tat
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\\n\\tat
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\\n\\tat
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n\\tat
java.lang.reflect.Method.invoke(Method.java:498)\\n\\tat
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)\\n\\t...
13 more\\nCaused by: java.util.concurrent.TimeoutException: Invocation of
public abstract java.util.concurrent.CompletableFuture
org.apache.flink.runtime.dispatcher.DispatcherGateway.submitJob(org.apache.flink.runtime.jobgraph.JobGraph,org.apache.flink.api.common.time.Time)
timed out.\\n\\tat com.sun.proxy.$Proxy113.submitJob(Unknown Source)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.lambda$submitJob$4(EmbeddedExecutor.java:158)\\n\\tat
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)\\n\\tat
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitJob(EmbeddedExecutor.java:158)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitAndGetJobClientFuture(EmbeddedExecutor.java:119)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.execute(EmbeddedExecutor.java:98)\\n\\tat
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1812)\\n\\t...
24 more\\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka://flink/user/rpc/dispatcher_1#-283770831]] after [60000 ms].
Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A
typical reason for `AskTimeoutException` is that the recipient actor didn't
send a reply.\\n\\tat
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)\\n\\tat
akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)\\n\\tat
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)\\n\\tat
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)\\n\\tat
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)\\n\\tat
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\\n\\tat
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\\n\\tat
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\\n\\tat
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\\n\\t...
1 more\\n\
{code}
> AskTimeoutException is thrown during job submission and completion
> ------------------------------------------------------------------
>
> Key: FLINK-11143
> URL: https://issues.apache.org/jira/browse/FLINK-11143
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.6.2, 1.10.0
> Reporter: Alex Vinnik
> Priority: Critical
> Attachments: flink-job-timeline.PNG
>
>
> For more details please see the thread
> [http://mail-archives.apache.org/mod_mbox/flink-user/201812.mbox/%[email protected]%3E]
> On submission
> 2018-12-12 02:28:31 ERROR JobsOverviewHandler:92 - Implementation error:
> Unhandled exception.
> akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka://flink/user/dispatcher#225683351|#225683351]] after [10000 ms].
> Sender[null] sent message of type
> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
> at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> at
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
> at java.lang.Thread.run(Thread.java:748)
>
> On completion
>
> {"errors":["Internal server error.","<Exception on server
> side:\njava.util.concurrent.CompletionException:
> akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka://flink/user/dispatcher#105638574]] after [10000 ms].
> Sender[null] sent message of type
> \"org.apache.flink.runtime.rpc.messages.LocalFencedMessage\".
> at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> at
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:772)
> at akka.dispatch.OnComplete.internal(Future.scala:258)
> at akka.dispatch.OnComplete.internal(Future.scala:256)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> at
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
> at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
> at java.lang.Thread.run(Thread.java:748)\nCaused by:
> akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka://flink/user/dispatcher#105638574]] after [10000 ms].
> Sender[null] sent message of type
> \"org.apache.flink.runtime.rpc.messages.LocalFencedMessage\".
> at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)\n\t...
> 9 more\n\nEnd of exception on server side>"]}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)