[ 
https://issues.apache.org/jira/browse/FLINK-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036957#comment-17036957
 ] 

Stephan Ewen commented on FLINK-16018:
--------------------------------------

I don't think this is a misalignment of timeouts. Indeed, when an Akka ask 
timeout propagates to the user, it means we are not handling failures correctly 
and simply report back whatever happened in the RPC system.

In this specific issue, we also have the problem of blocking/synchronous 
ExecutionGraph creation. Because some amount of synchronous initialization will 
most likely always be there, we would need some better way to handle responses 
to the client. Maybe an additional JobStatus (like INITIALIZING).

> Improve error reporting when submitting batch job (instead of 
> AskTimeoutException)
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-16018
>                 URL: https://issues.apache.org/jira/browse/FLINK-16018
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.0
>            Reporter: Robert Metzger
>            Priority: Major
>
> While debugging the {{Shaded Hadoop S3A end-to-end test (minio)}} pre-commit 
> test, I noticed that the JobSubmission is not producing very helpful error 
> messages.
> Environment:
> - A simple batch wordcount job 
> - a unavailable minio s3 filesystem service
> What happens from a user's perspective:
> - The job submission fails after 10 seconds with a AskTimeoutException:
> {code}
> 2020-02-07T11:38:27.1189393Z akka.pattern.AskTimeoutException: Ask timed out 
> on [Actor[akka://flink/user/dispatcher#-939201095]] after [10000 ms]. Message 
> of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply.
> 2020-02-07T11:38:27.1189538Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189616Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-02-07T11:38:27.1189713Z  at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-02-07T11:38:27.1189789Z  at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-02-07T11:38:27.1189883Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-02-07T11:38:27.1189973Z  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-02-07T11:38:27.1190067Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-02-07T11:38:27.1190159Z  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-02-07T11:38:27.1190267Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-02-07T11:38:27.1190358Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-02-07T11:38:27.1190465Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-02-07T11:38:27.1190540Z  at java.lang.Thread.run(Thread.java:748)
> {code}
> What a user would expect:
> - An error message indicating why the job submission failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to