qq619618919 opened a new pull request, #8026:
URL: https://github.com/apache/hadoop/pull/8026

   ### Description of PR
   JIRA: [YARN-11878](https://issues.apache.org/jira/browse/YARN-11878). 
AsyncDispatcher event queue backlog with millions of STATUS_UPDATE events
   
   Avoid costly ContainerStatusPBImpl.getCapability() calls in STATUS_UPDATE 
when Opportunistic containers are disabled
   
   ### Background
   This behavior was introduced by 
[YARN-11003](https://issues.apache.org/jira/browse/YARN-11003). to support 
Opportunistic containers optimization in the ResourceManager.
   
   To implement that optimization, `StatusUpdateWhenHealthyTransition` calls 
`ContainerStatusPBImpl.getCapability()` during every `STATUS_UPDATE` event.
   This ensures container resource capability info is always available for 
scheduling decisions
   when opportunistic containers are enabled.
   
   However, in clusters where **opportunistic containers are disabled**,
   retrieving `capability` in every `STATUS_UPDATE` becomes **unnecessary**,
   since the capability value is not used in most workflows.
   
   ### Currently
   **NodeManager heartbeat**: frequent `STATUS_UPDATE` events sent to the 
ResourceManager
   **Each STATUS_UPDATE processing**: triggers 
`ContainerStatusPBImpl.getCapability()`
   **Problem**: Even when the opportunistic container feature is **off**, the 
same costly protobuf parsing and `ResourcePBImpl` object construction still 
happens for each event. This leads to:
   1. High CPU usage in the AsyncDispatcher event processing thread
   2. Millions of repeated, unused protobuf parses in large clusters
   3. Increased event queue latency and slower scheduling decisions
   
   ### Impact
   In clusters with thousands of nodes, `STATUS_UPDATE` events can account for 
>90% of the AsyncDispatcher queue.
   Profiling shows that `getCapability()` calls consume >90% of CPU time in 
`StatusUpdateWhenHealthyTransition.transition()` when opportunistic containers 
are disabled.
   The overhead is **pure waste** under these conditions and can be entirely 
skipped.
   
   ### Proposed Changes
   1. Skip capability retrieval logic when `opportunisticContainersEnabled` is 
false.
   2. Cache `remoteContainer.getCapability()` result in a local variable to 
prevent multiple protobuf parsing calls within the same STATUS_UPDATE handling.
   
   
   ### How was this patch tested?
   CI
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'YARN-11878. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to