[
https://issues.apache.org/jira/browse/FLINK-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
YufeiLiu updated FLINK-15959:
-----------------------------
Description:
Flink removed `-n` option after FLIP-6, change to ResourceManager start a new
worker when required. But I think maintain a certain amount of slots is
necessary. These workers will start immediately when ResourceManager starts and
would not release even if all slots are free.
Here are some resons:
# Users actually know how many resources are needed when run a single job,
initialize all workers when cluster starts can speed up startup process.
# Job schedule in topology order, next operator won't schedule until prior
execution slot allocated. The TaskExecutors will start in several batchs in
some cases, it might slow down the startup speed.
# Flink support [FLINK-12122|https://issues.apache.org/jira/browse/FLINK-12122]
[Spread out tasks evenly across all available registered TaskManagers], but it
will only effect if all TMs are registered. Start all TMs at begining can slove
this problem.
*suggestion:*
* Add config "taskmanager.minimum.numberOfTotalSlots" and
"taskmanager.maximum.numberOfTotalSlots".
* Start plenty number of workers to satisfy minimum slots when ResourceManager
accept leadership(subtract recovered workers).
* Don't comlete slot request until minimum number of slots are registered, and
throw exeception when exceed maximum.
was:
Flink removed `-n` option after FLIP-6, change to ResourceManager start a new
worker when required. But I think maintain a certain amount of slots is
necessary. These workers will start immediately when ResourceManager starts and
would not release even if all slots are free.
Here are some resons:
# Users actually know how many resources are needed when run a single job,
initialize all workers when cluster starts can speed up startup process.
# Job schedule in topology order, next operator won't schedule until prior
execution slot allocated. The TaskExecutors will start in several batchs in
some cases, it might slow down the startup speed.
# Flink support [FLINK-12122|https://issues.apache.org/jira/browse/FLINK-12122]
[Spread out tasks evenly across all available registered TaskManagers], but it
will only effect if all TMs are registered. Start all TMs at begining can slove
this problem.
*suggestion:*
Add config "taskmanager.minimum.numberOfTotalSlots" and
"taskmanager.maximum.numberOfTotalSlots", start plenty number of workers to
satisfy minimum slots when ResourceManager accept leadership(subtract recovered
workers).
Don't comlete slot request until minimum number of slots are registered, and
throw exeception when exceed maximum.
> Add min/max number of slots configuration to limit total number of slots
> ------------------------------------------------------------------------
>
> Key: FLINK-15959
> URL: https://issues.apache.org/jira/browse/FLINK-15959
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Coordination
> Affects Versions: 1.11.0
> Reporter: YufeiLiu
> Priority: Major
>
> Flink removed `-n` option after FLIP-6, change to ResourceManager start a new
> worker when required. But I think maintain a certain amount of slots is
> necessary. These workers will start immediately when ResourceManager starts
> and would not release even if all slots are free.
> Here are some resons:
> # Users actually know how many resources are needed when run a single job,
> initialize all workers when cluster starts can speed up startup process.
> # Job schedule in topology order, next operator won't schedule until prior
> execution slot allocated. The TaskExecutors will start in several batchs in
> some cases, it might slow down the startup speed.
> # Flink support
> [FLINK-12122|https://issues.apache.org/jira/browse/FLINK-12122] [Spread out
> tasks evenly across all available registered TaskManagers], but it will only
> effect if all TMs are registered. Start all TMs at begining can slove this
> problem.
> *suggestion:*
> * Add config "taskmanager.minimum.numberOfTotalSlots" and
> "taskmanager.maximum.numberOfTotalSlots".
> * Start plenty number of workers to satisfy minimum slots when
> ResourceManager accept leadership(subtract recovered workers).
> * Don't comlete slot request until minimum number of slots are registered,
> and throw exeception when exceed maximum.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)