[
https://issues.apache.org/jira/browse/FLINK-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058137#comment-15058137
]
ASF GitHub Bot commented on FLINK-3134:
---------------------------------------
Github user mxm commented on a diff in the pull request:
https://github.com/apache/flink/pull/1450#discussion_r47642396
--- Diff:
flink-yarn/src/main/scala/org/apache/flink/yarn/YarnJobManager.scala ---
@@ -754,6 +538,243 @@ class YarnJobManager(
memoryLimit
}
}
+
+ /**
+ * Heartbeats with the resource manager and handles container updates.
+ */
+ object AMRMClientAsyncHandler extends AMRMClientAsync.CallbackHandler {
+
+ private var client : AMRMClientAsync[ContainerRequest] = null
+
+ override def onError(e: Throwable): Unit = {
+ self ! decorateMessage(
+ StopYarnSession(
+ FinalApplicationStatus.FAILED,
+ "Error in communication with Yarn resource manager: " +
e.getMessage)
+ )
+ }
+
+ override def getProgress: Float = {
+ runningContainers.toFloat / numTaskManagers
+ }
+
+ override def onShutdownRequest(): Unit = {
--- End diff --
I'm running into https://issues.apache.org/jira/browse/YARN-1842 in the
tests. This is a race condition between the application master (shutting down)
and the yarn client (killing the application and removing it from the cache)
which leads to a
org.apache.hadoop.yarn.exceptions.InvalidApplicationMasterRequestException:
Application doesn't exist in cache appattempt_1450184990589_0005_000001
at
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.throwApplicationDoesNotExistInCacheException(ApplicationMasterService.java:329)
at
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:288)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.finishApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:75)
at
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:97)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1958)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1956)
when unregistering the resource manager client. I think we have to catch
this exception when it occurs while unregistering the application master.
For an explanation see also
https://issues.apache.org/jira/browse/YARN-1842?focusedCommentId=13977886&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13977886
> Make YarnJobManager's allocate call asynchronous
> ------------------------------------------------
>
> Key: FLINK-3134
> URL: https://issues.apache.org/jira/browse/FLINK-3134
> Project: Flink
> Issue Type: Bug
> Components: YARN Client
> Affects Versions: 0.10.0, 1.0.0, 0.10.1
> Reporter: Maximilian Michels
> Assignee: Maximilian Michels
> Fix For: 1.0.0
>
>
> The {{allocate()}} call is used in the {{YarnJobManager}} to send a heartbeat
> to the YARN resource manager. This call may block the JobManager actor system
> for arbitrary time, e.g. if retry handlers are set up within the call to
> allocate.
> I propose to use the {{AMRMClientAsync}}'s callback methods to send
> heartbeats and update the container information. The API is available for our
> supported Hadoop versions (2.3.0 and above).
> https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/yarn/client/api/async/AMRMClientAsync.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)