zhuqi-lucas opened a new pull request, #582:
URL: https://github.com/apache/yunikorn-k8shim/pull/582

   ### What is this PR for?
   
   Under some circumstances, it seems that placeholder allocations are being 
removed multiple times:
   
   ```
   2023-04-25T06:25:46.279Z     INFO    scheduler/partition.go:1233     
replacing placeholder allocation {"appID": "spark-000000031tn2lgv2gar", 
"allocationId": "20a4cf77-7095-4635-b9e9-43a7564385c4"}
   ...
   2023-04-25T06:25:46.299Z     INFO    scheduler/partition.go:1233     
replacing placeholder allocation {"appID": "spark-000000031tn2lgv2gar", 
"allocationId": "20a4cf77-7095-4635-b9e9-43a7564385c4"}
   ```
   
   
   This message only appears once in the codebase, in 
PartitionContext.removeAllocation(). Furthermore, it is guarded by a test for 
release.TerminationType == si.TerminationType_PLACEHOLDER_REPLACED. This would 
seem to indicate that removeAllocation() is somehow being called twice. I 
believe this would cause the used resources on the node to be subtracted twice 
for the same allocation. This quickly results in health checks failing:
   
   ```
   2023-04-25T06:26:10.632Z        WARN    scheduler/health_checker.go:176 
Scheduler is not healthy        {"health check values": [..., 
{"Name":"Consistency of data","Succeeded":false,"Description":"Check if node 
total resource = allocated resource + occupied resource + available 
resource","DiagnosisMessage":"Nodes with inconsistent data: 
[\"ip-10-0-112-148.eu-central-1.compute.internal\"]"}, ...]}
   ```
   
   ### What type of PR is it?
   * [ ] - Bug Fix
   * [ ] - Improvement
   * [ ] - Feature
   * [ ] - Documentation
   * [ ] - Hot Fix
   * [ ] - Refactoring
   
   ### Todos
   * [ ] - Task
   
   ### What is the Jira issue?
   * Open an issue on Jira https://issues.apache.org/jira/browse/YUNIKORN/
   * Put link here, and add [YUNIKORN-*Jira number*] in PR title, eg. 
`[YUNIKORN-2] Gang scheduling interface parameters`
   
   ### How should this be tested?
   
   ### Screenshots (if appropriate)
   
   ### Questions:
   * [ ] - The licenses files need update.
   * [ ] - There is breaking changes for older versions.
   * [ ] - It needs documentation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to