[ 
https://issues.apache.org/jira/browse/FLINK-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16883824#comment-16883824
 ] 

Xintong Song commented on FLINK-13241:
--------------------------------------

[~till.rohrmann], [~zhuzh]

Quick updates on discoveries so far.

I found that the resource manager failed to match registered slots to the 
pending task manager slots. The first round pending slots are not completed by 
the registered slots, however the pending requests still get allocated on the 
registered slots through the code path `handleFreeSlot`, which also unassigned 
the request from the pending task slot. As result, the free pending task slots 
are left in the slot manager while there is actually no pending slots. That 
prevent the resource manager from requesting new containers for the second 
round slot requests.

The registered slots are not matched with pending task manager slots because 
managed memory size in resource profiles do not exactly match. Could be small 
errors during megabytes and bytes conversions. I'll continue debugging on this.

> YarnResourceManager does not handle slot allocations in certain cases
> ---------------------------------------------------------------------
>
>                 Key: FLINK-13241
>                 URL: https://issues.apache.org/jira/browse/FLINK-13241
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.9.0
>            Reporter: Zhu Zhu
>            Priority: Major
>         Attachments: 17_37_05__07_12_2019.jpg
>
>
> In the case that a job allocates a few slots first and after a period 
> allocates some other slots. The YarnResourceManager seems to receive and 
> ignore the latter slot requests.
> To produce this issue, we can create a job with 2 vertices in different 
> shared groups, as shown below:
> !17_37_05__07_12_2019.jpg|width=433,height=127!
> Slot allocation for map2 vertex happens after the source vertex acquires 
> slots to decide its location, thus to meet the input constraints.
> YarnResourceManager can receive slot requests for map2, but seems not to 
> handle it and the job will hang there waiting for resources.
> In my observation, this issue does not happen on Flink(Version: 1.9-SNAPSHOT, 
> Rev:3bc322a, Date:26.06.2019 @ 17:28:51 CST). It should be a new issue after 
> that.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to