[
https://issues.apache.org/jira/browse/FLINK-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156363#comment-17156363
]
Steven Zhen Wu edited comment on FLINK-11143 at 7/12/20, 6:39 PM:
------------------------------------------------------------------
[~trohrmann] I am seeing a similar problem *when trying unaligned checkpoint
with 1.11.0*. The Flink job actually started fine. We didn't see this
AskTimeoutException thrown during job submission in without unaligned
checkpoint (1.10 or 1.11).
Some more context about the app
* a large-state stream join app (a few TBs)
* parallelism 1,440
* number of containers: 180
* Cores per container: 12
* TM_TASK_SLOTS: 8
* akka.ask.timeout: 120 s
* heartbeat.timeout: 120000
* web.timeout: 60000 (also tried larger values like 300,000 or 600,000 without
any difference)
I will send you the log files (with DEBUG level) in an email offline. Thanks a
lot for your help in advance!
{code:java}
\"errors\":[\"Internal server error.\",\"<Exception on server
side:\\norg.apache.flink.util.FlinkRuntimeException: Could not execute
application.\\n\\tat
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:81)\\n\\tat
org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:67)\\n\\tat
org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:99)\\n\\tat
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\\n\\tat
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\\n\\tat
java.util.concurrent.FutureTask.run(FutureTask.java:266)\\n\\tat
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\\n\\tat
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\\n\\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat
java.lang.Thread.run(Thread.java:748)\\nCaused by:
org.apache.flink.client.program.ProgramInvocationException: The main method
caused an error: Failed to execute job
'personalization-streaming-impressions-alt'.\\n\\tat
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)\\n\\tat
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)\\n\\tat
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)\\n\\tat
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:78)\\n\\t...
10 more\\nCaused by: org.apache.flink.util.FlinkException: Failed to execute
job 'personalization-streaming-impressions-alt'.\\n\\tat
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1823)\\n\\tat
org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)\\n\\tat
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)\\n\\tat
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1699)\\n\\tat
com.netflix.spaas.application.SpaasBaseApplication.execute(SpaasBaseApplication.java:54)\\n\\tat
com.netflix.dea.paa.streaming.impressions.ImpressionsJobMain$.main(ImpressionsJobMain.scala:12)\\n\\tat
com.netflix.dea.paa.streaming.impressions.ImpressionsJobMain.main(ImpressionsJobMain.scala)\\n\\tat
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\\n\\tat
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\\n\\tat
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n\\tat
java.lang.reflect.Method.invoke(Method.java:498)\\n\\tat
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)\\n\\t...
13 more\\nCaused by: java.util.concurrent.TimeoutException: Invocation of
public abstract java.util.concurrent.CompletableFuture
org.apache.flink.runtime.dispatcher.DispatcherGateway.submitJob(org.apache.flink.runtime.jobgraph.JobGraph,org.apache.flink.api.common.time.Time)
timed out.\\n\\tat com.sun.proxy.$Proxy113.submitJob(Unknown Source)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.lambda$submitJob$4(EmbeddedExecutor.java:158)\\n\\tat
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)\\n\\tat
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitJob(EmbeddedExecutor.java:158)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitAndGetJobClientFuture(EmbeddedExecutor.java:119)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.execute(EmbeddedExecutor.java:98)\\n\\tat
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1812)\\n\\t...
24 more\\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka://flink/user/rpc/dispatcher_1#-283770831]] after [60000 ms].
Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A
typical reason for `AskTimeoutException` is that the recipient actor didn't
send a reply.\\n\\tat
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)\\n\\tat
akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)\\n\\tat
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)\\n\\tat
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)\\n\\tat
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)\\n\\tat
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\\n\\tat
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\\n\\tat
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\\n\\tat
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\\n\\t...
1 more\\n\
{code}
was (Author: stevenz3wu):
[~trohrmann] I am seeing a similar problem when trying unaligned checkpoint
with 1.11.0. The Flink job actually started fine. We didn't see this
AskTimeoutException thrown during job submission in without unaligned
checkpoint (1.10 or 1.11).
Some more context about the app
* a large-state stream join app (a few TBs)
* parallelism 1,440
* number of containers: 180
* Cores per container: 12
* TM_TASK_SLOTS: 8
* akka.ask.timeout: 120 s
* heartbeat.timeout: 120000
* web.timeout: 60000 (also tried larger values like 300,000 or 600,000 without
any difference)
I will send you the log files (with DEBUG level) in an email offline. Thanks a
lot for your help in advance!
{code:java}
\"errors\":[\"Internal server error.\",\"<Exception on server
side:\\norg.apache.flink.util.FlinkRuntimeException: Could not execute
application.\\n\\tat
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:81)\\n\\tat
org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:67)\\n\\tat
org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:99)\\n\\tat
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\\n\\tat
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\\n\\tat
java.util.concurrent.FutureTask.run(FutureTask.java:266)\\n\\tat
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\\n\\tat
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\\n\\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat
java.lang.Thread.run(Thread.java:748)\\nCaused by:
org.apache.flink.client.program.ProgramInvocationException: The main method
caused an error: Failed to execute job
'personalization-streaming-impressions-alt'.\\n\\tat
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)\\n\\tat
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)\\n\\tat
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)\\n\\tat
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:78)\\n\\t...
10 more\\nCaused by: org.apache.flink.util.FlinkException: Failed to execute
job 'personalization-streaming-impressions-alt'.\\n\\tat
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1823)\\n\\tat
org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)\\n\\tat
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)\\n\\tat
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1699)\\n\\tat
com.netflix.spaas.application.SpaasBaseApplication.execute(SpaasBaseApplication.java:54)\\n\\tat
com.netflix.dea.paa.streaming.impressions.ImpressionsJobMain$.main(ImpressionsJobMain.scala:12)\\n\\tat
com.netflix.dea.paa.streaming.impressions.ImpressionsJobMain.main(ImpressionsJobMain.scala)\\n\\tat
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\\n\\tat
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\\n\\tat
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n\\tat
java.lang.reflect.Method.invoke(Method.java:498)\\n\\tat
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)\\n\\t...
13 more\\nCaused by: java.util.concurrent.TimeoutException: Invocation of
public abstract java.util.concurrent.CompletableFuture
org.apache.flink.runtime.dispatcher.DispatcherGateway.submitJob(org.apache.flink.runtime.jobgraph.JobGraph,org.apache.flink.api.common.time.Time)
timed out.\\n\\tat com.sun.proxy.$Proxy113.submitJob(Unknown Source)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.lambda$submitJob$4(EmbeddedExecutor.java:158)\\n\\tat
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)\\n\\tat
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitJob(EmbeddedExecutor.java:158)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitAndGetJobClientFuture(EmbeddedExecutor.java:119)\\n\\tat
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.execute(EmbeddedExecutor.java:98)\\n\\tat
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1812)\\n\\t...
24 more\\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka://flink/user/rpc/dispatcher_1#-283770831]] after [60000 ms].
Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A
typical reason for `AskTimeoutException` is that the recipient actor didn't
send a reply.\\n\\tat
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)\\n\\tat
akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)\\n\\tat
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)\\n\\tat
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)\\n\\tat
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)\\n\\tat
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\\n\\tat
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\\n\\tat
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\\n\\tat
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\\n\\t...
1 more\\n\
{code}
> AskTimeoutException is thrown during job submission and completion
> ------------------------------------------------------------------
>
> Key: FLINK-11143
> URL: https://issues.apache.org/jira/browse/FLINK-11143
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.6.2, 1.10.0
> Reporter: Alex Vinnik
> Priority: Critical
> Attachments: flink-job-timeline.PNG
>
>
> For more details please see the thread
> [http://mail-archives.apache.org/mod_mbox/flink-user/201812.mbox/%[email protected]%3E]
> On submission
> 2018-12-12 02:28:31 ERROR JobsOverviewHandler:92 - Implementation error:
> Unhandled exception.
> akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka://flink/user/dispatcher#225683351|#225683351]] after [10000 ms].
> Sender[null] sent message of type
> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
> at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> at
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
> at java.lang.Thread.run(Thread.java:748)
>
> On completion
>
> {"errors":["Internal server error.","<Exception on server
> side:\njava.util.concurrent.CompletionException:
> akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka://flink/user/dispatcher#105638574]] after [10000 ms].
> Sender[null] sent message of type
> \"org.apache.flink.runtime.rpc.messages.LocalFencedMessage\".
> at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> at
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:772)
> at akka.dispatch.OnComplete.internal(Future.scala:258)
> at akka.dispatch.OnComplete.internal(Future.scala:256)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> at
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
> at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
> at java.lang.Thread.run(Thread.java:748)\nCaused by:
> akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka://flink/user/dispatcher#105638574]] after [10000 ms].
> Sender[null] sent message of type
> \"org.apache.flink.runtime.rpc.messages.LocalFencedMessage\".
> at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)\n\t...
> 9 more\n\nEnd of exception on server side>"]}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)