[
https://issues.apache.org/jira/browse/FLINK-29344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Knauf reassigned FLINK-29344:
----------------------------------------
Assignee: Chesnay Schepler
> Make Adaptive Scheduler supports Fine-Grained Resource Management
> -----------------------------------------------------------------
>
> Key: FLINK-29344
> URL: https://issues.apache.org/jira/browse/FLINK-29344
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: Xintong Song
> Assignee: Chesnay Schepler
> Priority: Major
>
> This ticket is a reflection of the following Slack discussion:
> {quote}
> Donatien Schmitz
> Adaptive Scheduler thread:
> Hey all, it seems like the Adaptive Scheduler does not support fine grain
> resource management. I have fixed it and would like to know if you would be
> interested in a PR or if it was purposely designed to not support Fine grain
> resource management.
> rmetzger
> @Donatien Schmitz: I’m concerned that we don’t have a lot of review capacity
> right now, and I’m now aware of any users asking for it.
> rmetzger
> I couldn’t find a ticket for adding this feature, did you find one?
> If not, can you add one? This will allow us to at least making this feature
> show up on google, and people might comment on it, if they need it.
> rmetzger
> If the change is fairly self-contained, is unlikely to cause instabilities,
> then we can also consider merging it
> rmetzger
> @Xintong Song what do you think?
> Xintong Song
> @rmetzger, thanks for involving me.
> @Donatien Schmitz, thanks for bringing this up, and for volunteering on
> fixing this. Could you explain a bit more about how do you plan to fix this?
> Fine-grained resource management is not yet supported by adaptive scheduler,
> because there’s an issue that we haven’t find a good solution for. Namely, if
> only part of the resource requirements can be fulfilled, how do we decide
> which requirements should be fulfilled. E.g., say the job declares it needs
> 10 slots with resource 1 for map tasks, and another 10 slots with resource 2
> for reduce tasks. If there’s not enough resources (say only 10 slots can be
> allocated for simplicity), how many slots for map / reduce tasks should be
> allocated? Obviously, <10 map, 0 reduce> & <0 map, 10 reduce> would not work.
> For this example, a proportional scale-down (<5 map, 5 reduce>) seems
> reasonable. However, a proportional scale-down is not always easy (e.g.,
> requirements is <100 map, 1 reduce>), and the issue grows more complicated if
> you take lots of stages and the differences of slot sizes into consideration.
> I’d like to see adaptive scheduler also supports fine-grained resource
> management. If there’s a good solution to the above issue, I’d love to help
> review the effort.
> Donatien Schmitz
> Dear Robert and Xintong, thanks for reading and reacting to my message! I'll
> reply tomorrow (GTM +1 time) if that's quite alright with you. Best, Donatien
> Schmitz
> Donatien Schmitz
> @Xintong Song
> * We are working on fine-grain scheduling for resource optimisation of long
> running or periodic jobs. One of the feature we are experiencing is a
> "rescheduling plan", a mapping of operators and Resource Profiles that can be
> dynamically applied to a running job. This rescheduling would be triggered by
> policies about some metrics (focus on RocksDB in our case).
> * While developing this new feature, we decided to implement it on the
> Adpative Scheduler instead of the Base Scheduler because the logic brought by
> the state machine already present made it more logical: transitions from
> states Executing -> Cancelling -> Rescheduling -> Waiting for Resources ->
> Creating -> Executing
> * In our case we are working on a POC and thus focusing on a real simple job
> with a // of 1. The issue you brought is indeed something we have faced while
> raising the // of the job.
> * If you create a Jira Ticket we can discuss it over there if you'd like!
> Donatien Schmitz
> @rmetzger The changes do not break the default resource management but does
> not fix the issue brought out by Xintong.
> {quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)