[
https://issues.apache.org/jira/browse/FLINK-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884257#comment-16884257
]
Xintong Song commented on FLINK-13241:
--------------------------------------
Turns out it's not a bytes/megabytes converting issue.
We explicitly set the exact managed memory size into configuration on RM side,
to avoid TM calculating managed memory from fraction and uncertain JVM free
memory. However, we were setting the wrong configuration instance. In
YarnResourceManager constructor, we copied the configuration instance because
we are going to alter it. But we altered the original configuration instead of
the copied one which is used by TM.
I've opened a PR to fix it, and same problem for Mesos.
> YarnResourceManager does not handle slot allocations in certain cases
> ---------------------------------------------------------------------
>
> Key: FLINK-13241
> URL: https://issues.apache.org/jira/browse/FLINK-13241
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.9.0
> Reporter: Zhu Zhu
> Assignee: Xintong Song
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: 17_37_05__07_12_2019.jpg
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> In the case that a job allocates a few slots first and after a period
> allocates some other slots. The YarnResourceManager seems to receive and
> ignore the latter slot requests.
> To produce this issue, we can create a job with 2 vertices in different
> shared groups, as shown below:
> !17_37_05__07_12_2019.jpg|width=433,height=127!
> Slot allocation for map2 vertex happens after the source vertex acquires
> slots to decide its location, thus to meet the input constraints.
> YarnResourceManager can receive slot requests for map2, but seems not to
> handle it and the job will hang there waiting for resources.
> In my observation, this issue does not happen on Flink(Version: 1.9-SNAPSHOT,
> Rev:3bc322a, Date:26.06.2019 @ 17:28:51 CST). It should be a new issue after
> that.
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)