xintongsong commented on issue #8740: [FLINK-12763][runtime] Fail job immediately if tasks’ resource needs can not be satisfied. URL: https://github.com/apache/flink/pull/8740#issuecomment-508469942 Hi @StephanEwen, thank you for the comment, and sorry for being unclear. The changes in this PR are: - Set `ResourceProfile` for slot requests according to the `ResourceSpec`. Before this PR, the slot requests are always attached with an `UNKNOWN` resource profiles, no matter what the `ResourceSpec` is. - Fail a slot request immediately if it requests a slot that too large to be satisfied. This is to avoid waiting for the slot request timeout to discover the problem. - For Yarn/Mesos, the resource profiles of slots in the cluster is determined by the configuration on RM side. Therefore, RM knows slots with what resource profiles are available at the very beginning. - For Standalone, RM does not know which slots exist and what resource profiles they have until the TMs are registered. If RM receives a slot request that can not be satisfied by any registered slot, it doesn't know whether to fail the request or to wait for other TMs to register. The solution in this PR is to have an initial period after the RM being started, excepting most TMs should register to RM during this period. Then we allow slot requests with any resource profile pending during this period, and fail pending and new coming requests that can not be satisfied by any registered slot after this period. I'll rebase the PR to the latest code and reorganize it.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
