liubin101 opened a new pull request, #7997: URL: https://github.com/apache/hadoop/pull/7997
### Description of PR Under the YARN Federation architecture, when there's only one sub-cluster, or when multiple sub-clusters exist but a queue is configured to submit exclusively to a specific sub-cluster, the following problem occurs: If this target sub-cluster failover, its state will remain "non-active" until the new active RM completes registration and sends the first heartbeat (default 30 seconds later). During this window, any client attempting to submit applications will fail with the error: "No active SubCluster available to submit the request" or "No positive weight found on active subclusters". In non-YARN Federation setups, clients automatically retry during RM failover. We expect this retry behavior to be preserved when transitioning to the YARN Federation architecture. ### How was this patch tested? unit test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
