qq619618919 opened a new pull request, #8026: URL: https://github.com/apache/hadoop/pull/8026
### Description of PR JIRA: [YARN-11878](https://issues.apache.org/jira/browse/YARN-11878). AsyncDispatcher event queue backlog with millions of STATUS_UPDATE events Avoid costly ContainerStatusPBImpl.getCapability() calls in STATUS_UPDATE when Opportunistic containers are disabled ### Background This behavior was introduced by [YARN-11003](https://issues.apache.org/jira/browse/YARN-11003). to support Opportunistic containers optimization in the ResourceManager. To implement that optimization, `StatusUpdateWhenHealthyTransition` calls `ContainerStatusPBImpl.getCapability()` during every `STATUS_UPDATE` event. This ensures container resource capability info is always available for scheduling decisions when opportunistic containers are enabled. However, in clusters where **opportunistic containers are disabled**, retrieving `capability` in every `STATUS_UPDATE` becomes **unnecessary**, since the capability value is not used in most workflows. ### Currently **NodeManager heartbeat**: frequent `STATUS_UPDATE` events sent to the ResourceManager **Each STATUS_UPDATE processing**: triggers `ContainerStatusPBImpl.getCapability()` **Problem**: Even when the opportunistic container feature is **off**, the same costly protobuf parsing and `ResourcePBImpl` object construction still happens for each event. This leads to: 1. High CPU usage in the AsyncDispatcher event processing thread 2. Millions of repeated, unused protobuf parses in large clusters 3. Increased event queue latency and slower scheduling decisions ### Impact In clusters with thousands of nodes, `STATUS_UPDATE` events can account for >90% of the AsyncDispatcher queue. Profiling shows that `getCapability()` calls consume >90% of CPU time in `StatusUpdateWhenHealthyTransition.transition()` when opportunistic containers are disabled. The overhead is **pure waste** under these conditions and can be entirely skipped. ### Proposed Changes 1. Skip capability retrieval logic when `opportunisticContainersEnabled` is false. 2. Cache `remoteContainer.getCapability()` result in a local variable to prevent multiple protobuf parsing calls within the same STATUS_UPDATE handling. ### How was this patch tested? CI ### For code changes: - [x] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'YARN-11878. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
