[ 
https://issues.apache.org/jira/browse/FLINK-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033423#comment-17033423
 ] 

Xintong Song commented on FLINK-15959:
--------------------------------------

Hi [~liuyufei],

Thanks for opening this ticket. The use cases sounds reasonable to me, 
especially regarding the load balancing.

My main concern is about blocking slot requests until minimum number of slots 
are registered. I'm not sure how this would affect the job / cluster startup 
time. It might be ok if it only affect cases where this new feature is used.

Some suggestions on the proposed approache, say if we decided to solve the 
problem along this direction.
- I think this is not a Yarn specific issue, but a common issue for all active 
deployments including k8s and mesos. Therefore, the changes should not be made 
in {{ YarnResourceManager }}, but rather some common codes like {{ 
ResourceManager }} or {{ SlotManager }}.
- Instead of the total number of task executors, I would suggest to expose the 
configuration option to users as the minimum number of slots. I think the slot 
number is more aligned with users' knowledge on the job parallelism, and the 
proposed "don't comlete slot request until minimum slots are registered". And 
by defining it as the minimum rather than total, we can always allocated more 
containers than the configured value if needed. If the make the default value 
of minimum slot number to 0, then we have the same behavior as before when the 
config option is not explicitly configured.

> Add TaskExecutor number option in FlinkYarnSessionCli
> -----------------------------------------------------
>
>                 Key: FLINK-15959
>                 URL: https://issues.apache.org/jira/browse/FLINK-15959
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.0
>            Reporter: YufeiLiu
>            Priority: Major
>
> Flink removed `-n` option after FLIP-6, change to ResourceManager start a new 
> worker when required. But I think maintain a TaskExecutor number option is 
> necessary. These workers will start immediately when ResourceManager starts 
> and would not release even if all slots are free.
> Here are some resons:
> # Users actually know how many resources are needed when run a single job, 
> initialize all workers when cluster starts can speed up startup process.
> #  Job schedule in  topology order,  next operator won't schedule until prior 
> execution slot allocated. The TaskExecutors will start in several batchs in 
> some cases, it might slow down the startup speed.
> # Flink support 
> [FLINK-12122|https://issues.apache.org/jira/browse/FLINK-12122] [Spread out 
> tasks evenly across all available registered TaskManagers], but it will only 
> effect if all TMs are registered. Start all TMs at begining can slove this 
> problem.
> *suggestion:*
> I only changed YarnResourceManager, start all container in `initialize` 
> stage, and don't comlete slot request until minimum number of slots are 
> registered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to