[
https://issues.apache.org/jira/browse/FLINK-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037719#comment-17037719
]
Yangze Guo edited comment on FLINK-15959 at 2/16/20 7:10 AM:
-------------------------------------------------------------
Hi, [~liuyufei]. Thanks for proposing this change.
Regarding your suggestion, I think ResourceManager does not need to throw an
exception when exceeding the maximum limit. AFAIK, for batch jobs, the job
graph could be executed without all the slot requests are fulfilled. We may
just give an information-level log in this scenario. For stream jobs, those
slot requests could not be fulfilled would fail in the timeout check.
BTW, since it touches the Public interface, I think we need to open a FLIP for
this change. I also wanna help to introduce the maximum resource limitation for
task executors, there could be an interrelationship between the maximum and
minimum limit. Would you like to work for it together?
was (Author: karmagyz):
Hi, [~liuyufei]. Thanks for proposing this change.
Regarding your suggestion, I think ResourceManager does not need to throw an
exception when exceeding the maximum limit. AFAIK, for batch jobs, the job
graph could be executed without all the slot requests are fulfilled. We may
just give an information-level log in this scenario. For stream jobs, those
slot requests could not be fulfilled would fail by the timeout check.
BTW, since it touches the Public interface, I think we need to open a FLIP for
this change. I also wanna introduce the maximum resource limitation for task
executors, there could be an interrelationship between the maximum and minimum
limit. Would you like to work for it together?
> Add min/max number of slots configuration to limit total number of slots
> ------------------------------------------------------------------------
>
> Key: FLINK-15959
> URL: https://issues.apache.org/jira/browse/FLINK-15959
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.11.0
> Reporter: YufeiLiu
> Priority: Major
>
> Flink removed `-n` option after FLIP-6, change to ResourceManager start a new
> worker when required. But I think maintain a certain amount of slots is
> necessary. These workers will start immediately when ResourceManager starts
> and would not release even if all slots are free.
> Here are some resons:
> # Users actually know how many resources are needed when run a single job,
> initialize all workers when cluster starts can speed up startup process.
> # Job schedule in topology order, next operator won't schedule until prior
> execution slot allocated. The TaskExecutors will start in several batchs in
> some cases, it might slow down the startup speed.
> # Flink support
> [FLINK-12122|https://issues.apache.org/jira/browse/FLINK-12122] [Spread out
> tasks evenly across all available registered TaskManagers], but it will only
> effect if all TMs are registered. Start all TMs at begining can slove this
> problem.
> *suggestion:*
> * Add config "taskmanager.minimum.numberOfTotalSlots" and
> "taskmanager.maximum.numberOfTotalSlots", default behavior is still like
> before.
> * Start plenty number of workers to satisfy minimum slots when
> ResourceManager accept leadership(subtract recovered workers).
> * Don't comlete slot request until minimum number of slots are registered,
> and throw exeception when exceed maximum.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)