pbacsko commented on code in PR #933:
URL: https://github.com/apache/yunikorn-core/pull/933#discussion_r1706995762


##########
pkg/common/resources/resources.go:
##########
@@ -364,6 +364,21 @@ func (r *Resource) SubOnlyExisting(delta *Resource) {
        }
 }
 
+// AddOnlyExisting adds delta to defined resource.
+// Ignore any type not defined in the base resource (ie receiver).
+func (r *Resource) AddOnlyExisting(delta *Resource) *Resource {

Review Comment:
   +1 I think this new method would be the only one which is defined on 
`*Resource` and also returns something, so it's inconsistent with the existing 
approaches.



##########
pkg/scheduler/objects/queue.go:
##########
@@ -1017,17 +1016,18 @@ func (sq *Queue) IncAllocatedResource(alloc 
*resources.Resource, nodeReported bo
        sq.Lock()
        defer sq.Unlock()
        // all OK update this queue
-       sq.allocatedResource = newAllocated
+       sq.allocatedResource = resources.Add(sq.allocatedResource, alloc)
        sq.updateAllocatedResourceMetrics()
        return nil
 }
 
+// allocatedResFits adds the passed in resource to the allocatedResource of 
the queue and checks if it still fits in the
+// queues' maximum. If the resource fits it returns true otherwise false.
 // small helper method to access sq.maxResource+sq.allocatedResource and avoid 
Clone() call
-func (sq *Queue) allocatedResFits(res *resources.Resource) (bool, 
*resources.Resource) {
+func (sq *Queue) allocatedResFits(res *resources.Resource) bool {
        sq.RLock()
        defer sq.RUnlock()
-       newAllocated := resources.Add(sq.allocatedResource, res)

Review Comment:
   I'm thinking about this...
   
   This is the error message which Craig reproduced:
   
   ```
   2024-07-26T14:42:30.364Z    DPANIC    core.scheduler.application    
objects/application.go:1496    queue update failed unexpectedly    {"error": 
"allocation (map[pods:1]) puts queue 'root' over maximum allocation 
(map[[devices.kubevirt.io/kvm:0](http://devices.kubevirt.io/kvm:0) 
[devices.kubevirt.io/tun:0](http://devices.kubevirt.io/tun:0) 
[devices.kubevirt.io/vhost-net:0](http://devices.kubevirt.io/vhost-net:0) 
ephemeral-storage:744815809367 hugepages-1Gi:0 hugepages-2Mi:0 
memory:1115240587264 [nvidia.com/gpu:0](http://nvidia.com/gpu:0) pods:660 
vcore:320000]), current usage (map[memory:314572800 
[nvidia.com/gpu:1](http://nvidia.com/gpu:1) pods:23 vcore:200])"}
   ```
   
   It says `[nvidia.com/gpu:0]` and this is printed from 
`sq.allocatedResource`, which means that this resource type is actually 
present. That is, if we want to prevent this bug, this change won't solve it 
because both `AddOnlyExisting()` and `FitInMaxUnder()` will consider the gpu 
type. Am I missing something here?
   
   I haven't run any code just trying to have a good understanding simply by 
looking at it :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to