Alex Ovchenkov created YUNIKORN-3231:
----------------------------------------
Summary: ForeignAllocations ignored / not accounted when pods use
nodeSelector (works with nodeAffinity)
Key: YUNIKORN-3231
URL: https://issues.apache.org/jira/browse/YUNIKORN-3231
Project: Apache YuniKorn
Issue Type: Bug
Environment:
Kubernetes: 1.26
YuniKorn scheduler: 1.8.0
Reporter: Alex Ovchenkov
h2. Description
We observed that YuniKorn’s *ForeignAllocations* (pods not managed by YuniKorn
/ allocations coming from outside) may not be accounted correctly when such
pods use {*}nodeSelector{*}. As a result, YuniKorn may overestimate node/queue
available resources and make scheduling decisions as if those external pods do
not exist (or are not tied to the expected nodes).
When we replace {{nodeSelector}} with an equivalent
{{{}nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution{}}}, the
behavior becomes correct and ForeignAllocations are accounted as expected.
h2. *Steps to reproduce*
h5. 1) Create a pod that is scheduled by Kubernetes default scheduler (or
another scheduler) and uses nodeSelector:
{code:java}
apiVersion: v1
kind: Pod
metadata:
name: foreign-pod-selector
namespace: <ns>
spec:
schedulerName: default-scheduler
nodeSelector:
dedicated: test
containers:
- name: c
image: busybox
command: ["sh", "-c", "sleep 360000"]
resources:
requests:
cpu: "1"
memory: "1Gi" {code}
h5. 2) Observe YuniKorn resource accounting / foreign allocations
Check YuniKorn UI / REST API / state dump and verify whether this pod appears
as a foreign allocation and whether its resources are deducted from the node.
h5. 3) Replace nodeSelector with equivalent nodeAffinity
(requiredDuringScheduling)
Delete the pod and apply the same pod but with nodeAffinity instead:
{code:java}
apiVersion: v1
kind: Pod
metadata:
name: foreign-pod-affinity
namespace: <ns>
spec:
schedulerName: default-scheduler
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: dedicated
operator: In
values: ["test"]
containers:
- name: c
image: busybox
command: ["sh", "-c", "sleep 360000"]
resources:
requests:
cpu: "1"
memory: "1Gi" {code}
Re-check YuniKorn accounting / foreign allocations.
----
h2. Actual behavior
* With {*}nodeSelector{*}, the foreign pod is *not consistently* represented
in ForeignAllocations / node accounting (or appears without correct node
association / resources not deducted).
* This can lead to YuniKorn scheduling decisions that assume more resources
are available on the affected node(s) than in reality.
----
h2. Expected behavior
* Foreign pods should be accounted consistently regardless of whether the
workload uses:
** {{{}spec.nodeSelector{}}}, or
**
{{spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution}}
* If the pod is bound to a node (and consuming resources), YuniKorn should
reflect it in foreign allocations / node resource usage.
----
h2. Notes / hypothesis
Based on initial investigation, the issue might be related to how foreign pods
are processed in {{{}updateForeignPod{}}}:
[https://github.com/apache/yunikorn-k8shim/blob/master/pkg/cache/context.go#L412]
It appears that when a pod uses {{{}nodeSelector{}}}, some of its
scheduling-related state (possibly node assignment or allocation state) may
already be set earlier in the reconciliation flow, which could cause
{{updateForeignPod}} to skip or treat it differently.
In contrast, when equivalent constraints are expressed using
{{{}nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution{}}}, the
foreign allocation seems to be tracked and accounted correctly.
This suggests there may be a difference in how {{nodeSelector}} and
{{nodeAffinity}} are interpreted or propagated into the internal foreign
allocation tracking logic.
We may be missing some detail in the lifecycle of foreign pod updates, but the
observable behavior indicates inconsistent accounting depending on whether
{{nodeSelector}} or {{nodeAffinity}} is used.
----
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]