pbacsko commented on code in PR #933:
URL: https://github.com/apache/yunikorn-core/pull/933#discussion_r1707022863


##########
pkg/scheduler/objects/queue.go:
##########
@@ -1017,17 +1016,18 @@ func (sq *Queue) IncAllocatedResource(alloc 
*resources.Resource, nodeReported bo
        sq.Lock()
        defer sq.Unlock()
        // all OK update this queue
-       sq.allocatedResource = newAllocated
+       sq.allocatedResource = resources.Add(sq.allocatedResource, alloc)
        sq.updateAllocatedResourceMetrics()
        return nil
 }
 
+// allocatedResFits adds the passed in resource to the allocatedResource of 
the queue and checks if it still fits in the
+// queues' maximum. If the resource fits it returns true otherwise false.
 // small helper method to access sq.maxResource+sq.allocatedResource and avoid 
Clone() call
-func (sq *Queue) allocatedResFits(res *resources.Resource) (bool, 
*resources.Resource) {
+func (sq *Queue) allocatedResFits(res *resources.Resource) bool {
        sq.RLock()
        defer sq.RUnlock()
-       newAllocated := resources.Add(sq.allocatedResource, res)

Review Comment:
   I'm thinking about this...
   
   This is the error message which Craig reproduced:
   
   ```
   2024-07-26T14:42:30.364Z    DPANIC    core.scheduler.application    
objects/application.go:1496    queue update failed unexpectedly    {"error": 
"allocation (map[pods:1]) puts queue 'root' over maximum allocation 
(map[[devices.kubevirt.io/kvm:0](http://devices.kubevirt.io/kvm:0) 
[devices.kubevirt.io/tun:0](http://devices.kubevirt.io/tun:0) 
[devices.kubevirt.io/vhost-net:0](http://devices.kubevirt.io/vhost-net:0) 
ephemeral-storage:744815809367 hugepages-1Gi:0 hugepages-2Mi:0 
memory:1115240587264 [nvidia.com/gpu:0](http://nvidia.com/gpu:0) pods:660 
vcore:320000]), current usage (map[memory:314572800 
[nvidia.com/gpu:1](http://nvidia.com/gpu:1) pods:23 vcore:200])"}
   ```
   
   It says `[nvidia.com/gpu:0]` and this is printed from 
`sq.allocatedResource`, which means that this resource type is actually 
present. That is, if we want to prevent this bug, this change won't solve it 
because both `AddOnlyExisting()` and `FitInMaxUnder()` will consider the gpu 
type. Am I missing something here?
   
   I haven't run any code locally, just trying to have a good understanding by 
simply looking at it :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to