[jira] [Commented] (FLINK-24005) Resource requirements declaration may be incorrect if JobMaster disconnects with a TaskManager with available slots in the SlotPool

Till Rohrmann (Jira) Tue, 31 Aug 2021 07:06:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-24005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407372#comment-17407372
 ]


Till Rohrmann commented on FLINK-24005:
---------------------------------------

{quote}In practice this is currently the case, but if the AdaptiveScheduler 
were to actually support resource profiles then it would also need it I 
believe.{quote}
I am not so sure about this. The {{AdaptiveScheduler}} is in charge of the 
resource requirements and which slot it uses to fulfill what requirement. I 
would see it as the responsibility of the {{AdaptiveScheduler}} to adjust the 
requirements if it decides to {{Executions}} and {{Slots}} differently.

{quote}Generally yes but this was already the case ever since we added the 
DeclarativeSlotPool. This problem isn't new, and we made the conscious decision 
to limit any resource concerns to the pool. It certainly wasn't ideal, the so 
is that we still need the bridge in the first place.{quote}
I agree that the responsibilities are not well separated atm. From a 
maintenance perspective it would be desirable to move all the default scheduler 
specific logic into the {{DeclarativeSlotPoolBridge}}. Maybe it is now a bit 
clearer which responsibility should go into which class after having the 
{{AdaptiveScheduler}} written.

bq. Will we extend the adaptive scheduler to cover batch jobs?

Eventually this would be really nice. But I don't have a good idea how to do it 
atm.

bq. Will we refactor the DefaultScheduler to directly work with declarative 
resource management (at the very least declaring requirements explicitly (which 
would solve a lot of issues))?

I don't think so. The declarative resource management specific logic will 
probably be contained in the {{DeclarativeSlotPoolBridge}}.

bq. Will we continue to have this compatibility layer?

Yes I think so.

bq. Before we start any larger refactorings we should have a clear idea where 
we are even headed.

I agree. But we should acknowledge that the proposed fix will further entangle 
{{DeclarativeSlotPoolBridge}} and {{DefaultDeclarativeSlotPool}} which can make 
their maintenance harder in the future.


> Resource requirements declaration may be incorrect if JobMaster disconnects 
> with a TaskManager with available slots in the SlotPool
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-24005
>                 URL: https://issues.apache.org/jira/browse/FLINK-24005
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0, 1.12.5, 1.13.2
>            Reporter: Zhu Zhu
>            Assignee: Chesnay Schepler
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.14.0, 1.12.6, 1.13.3
>
>         Attachments: decrease_resource_requirements.log
>
>
> When a TaskManager disconnects with JobMaster, it will trigger the 
> `DeclarativeSlotPoolService#decreaseResourceRequirementsBy()` for all the 
> slots that are registered to the JobMaster from the TaskManager. If the slots 
> are still available, i.e. not assigned to any task,  the 
> `decreaseResourceRequirementsBy` may lead to incorrect resource requirements 
> declaration.
> For example, there is one job with 3 source tasks only. It requires 3 slots 
> and declares for 3 slots. Initially all the tasks are running. Suddenly one 
> task failed and waits for some delay before restarting. The previous slot is 
> returned to the SlotPool. Now the job requires 2 slots and declares for 2 
> slots. At this moment, the TaskManager of that returned slot get lost. After 
> the triggered `decreaseResourceRequirementsBy`, the job only declares for 1 
> slot. Finally, when the failed task starts to re-schedule, the job will 
> declare for 2 slots while it actually needs 3 slots.
> The attached log of a real job and logs of the added test in 
> https://github.com/zhuzhurk/flink/commit/59ca0ac5fa9c77b97c6e8a43dcc53ca8a0ad6c37
>  can demonstrate this case.
> Note that the real job is configured with a large 
> "restart-strategy.fixed-delay.delay" and and large "slot.idle.timeout". So 
> possibly in production it is a rare case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-24005) Resource requirements declaration may be incorrect if JobMaster disconnects with a TaskManager with available slots in the SlotPool

Reply via email to