Alex Ovchenkov created YUNIKORN-3231:
----------------------------------------

             Summary: ForeignAllocations ignored / not accounted when pods use 
nodeSelector (works with nodeAffinity)
                 Key: YUNIKORN-3231
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3231
             Project: Apache YuniKorn
          Issue Type: Bug
         Environment: 
Kubernetes: 1.26
YuniKorn scheduler: 1.8.0

            Reporter: Alex Ovchenkov


h2. Description

We observed that YuniKorn’s *ForeignAllocations* (pods not managed by YuniKorn 
/ allocations coming from outside) may not be accounted correctly when such 
pods use {*}nodeSelector{*}. As a result, YuniKorn may overestimate node/queue 
available resources and make scheduling decisions as if those external pods do 
not exist (or are not tied to the expected nodes).

When we replace {{nodeSelector}} with an equivalent 
{{{}nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution{}}}, the 
behavior becomes correct and ForeignAllocations are accounted as expected.
h2. *Steps to reproduce*
h5. 1) Create a pod that is scheduled by Kubernetes default scheduler (or 
another scheduler) and uses nodeSelector:

 
{code:java}
apiVersion: v1
kind: Pod
metadata:
  name: foreign-pod-selector
  namespace: <ns>
spec:
  schedulerName: default-scheduler
  nodeSelector:
    dedicated: test
  containers:
    - name: c
      image: busybox
      command: ["sh", "-c", "sleep 360000"]
      resources:
        requests:
          cpu: "1"
          memory: "1Gi" {code}
h5. 2) Observe YuniKorn resource accounting / foreign allocations

Check YuniKorn UI / REST API / state dump and verify whether this pod appears 
as a foreign allocation and whether its resources are deducted from the node.
h5. 3) Replace nodeSelector with equivalent nodeAffinity 
(requiredDuringScheduling)

Delete the pod and apply the same pod but with nodeAffinity instead:
{code:java}
apiVersion: v1
kind: Pod
metadata:
  name: foreign-pod-affinity
  namespace: <ns>
spec:
  schedulerName: default-scheduler
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: dedicated
                operator: In
                values: ["test"]
  containers:
    - name: c
      image: busybox
      command: ["sh", "-c", "sleep 360000"]
      resources:
        requests:
          cpu: "1"
          memory: "1Gi" {code}
Re-check YuniKorn accounting / foreign allocations.
----
h2. Actual behavior
 * With {*}nodeSelector{*}, the foreign pod is *not consistently* represented 
in ForeignAllocations / node accounting (or appears without correct node 
association / resources not deducted).

 * This can lead to YuniKorn scheduling decisions that assume more resources 
are available on the affected node(s) than in reality.

----
h2. Expected behavior
 * Foreign pods should be accounted consistently regardless of whether the 
workload uses:

 ** {{{}spec.nodeSelector{}}}, or

 ** 
{{spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution}}

 * If the pod is bound to a node (and consuming resources), YuniKorn should 
reflect it in foreign allocations / node resource usage.

----
h2. Notes / hypothesis

Based on initial investigation, the issue might be related to how foreign pods 
are processed in {{{}updateForeignPod{}}}:

[https://github.com/apache/yunikorn-k8shim/blob/master/pkg/cache/context.go#L412]

It appears that when a pod uses {{{}nodeSelector{}}}, some of its 
scheduling-related state (possibly node assignment or allocation state) may 
already be set earlier in the reconciliation flow, which could cause 
{{updateForeignPod}} to skip or treat it differently.

In contrast, when equivalent constraints are expressed using 
{{{}nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution{}}}, the 
foreign allocation seems to be tracked and accounted correctly.

This suggests there may be a difference in how {{nodeSelector}} and 
{{nodeAffinity}} are interpreted or propagated into the internal foreign 
allocation tracking logic.

We may be missing some detail in the lifecycle of foreign pod updates, but the 
observable behavior indicates inconsistent accounting depending on whether 
{{nodeSelector}} or {{nodeAffinity}} is used.
----
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to