[ https://issues.apache.org/jira/browse/FLINK-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16883824#comment-16883824 ]
Xintong Song commented on FLINK-13241: -------------------------------------- [~till.rohrmann], [~zhuzh] Quick updates on discoveries so far. I found that the resource manager failed to match registered slots to the pending task manager slots. The first round pending slots are not completed by the registered slots, however the pending requests still get allocated on the registered slots through the code path `handleFreeSlot`, which also unassigned the request from the pending task slot. As result, the free pending task slots are left in the slot manager while there is actually no pending slots. That prevent the resource manager from requesting new containers for the second round slot requests. The registered slots are not matched with pending task manager slots because managed memory size in resource profiles do not exactly match. Could be small errors during megabytes and bytes conversions. I'll continue debugging on this. > YarnResourceManager does not handle slot allocations in certain cases > --------------------------------------------------------------------- > > Key: FLINK-13241 > URL: https://issues.apache.org/jira/browse/FLINK-13241 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN > Affects Versions: 1.9.0 > Reporter: Zhu Zhu > Priority: Major > Attachments: 17_37_05__07_12_2019.jpg > > > In the case that a job allocates a few slots first and after a period > allocates some other slots. The YarnResourceManager seems to receive and > ignore the latter slot requests. > To produce this issue, we can create a job with 2 vertices in different > shared groups, as shown below: > !17_37_05__07_12_2019.jpg|width=433,height=127! > Slot allocation for map2 vertex happens after the source vertex acquires > slots to decide its location, thus to meet the input constraints. > YarnResourceManager can receive slot requests for map2, but seems not to > handle it and the job will hang there waiting for resources. > In my observation, this issue does not happen on Flink(Version: 1.9-SNAPSHOT, > Rev:3bc322a, Date:26.06.2019 @ 17:28:51 CST). It should be a new issue after > that. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)