[
https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070839#comment-17070839
]
Till Rohrmann commented on FLINK-16018:
---------------------------------------
As stated before, the underlying problem is that we wait for the {{JobManager}}
creation before acknowledging the job submission. Hence, the proper fix would
be to make the job submission "non-blocking" (technically is already non
blocking but the response is blocked).
Since this effort is a bit bigger I would suggest to do the following:
* As part of this issue and as a quick fix, we increase {{web.timeout}} to 10
minutes.
* I'll create a follow up issue to make the job submission non-blocking which
we try to fix asap
> Improve error reporting when submitting batch job (instead of
> AskTimeoutException)
> ----------------------------------------------------------------------------------
>
> Key: FLINK-16018
> URL: https://issues.apache.org/jira/browse/FLINK-16018
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.9.2, 1.10.0
> Reporter: Robert Metzger
> Assignee: Till Rohrmann
> Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit
> test, I noticed that the JobSubmission is not producing very helpful error
> messages.
> Environment:
> - A simple batch wordcount job
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out
> on [Actor[akka://flink/user/dispatcher#-939201095]] after [10000 ms]. Message
> of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical
> reason for `AskTimeoutException` is that the recipient actor didn't send a
> reply.
> 2020-02-07T11:38:27.1189538Z at
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z at
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z at
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z at
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z at
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z at
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z at
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z at
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)