Alex Ovchenkov created YUNIKORN-3230:
----------------------------------------
Summary: YuniKorn does not fail application when placeholder
creation is rejected by LimitRange
Key: YUNIKORN-3230
URL: https://issues.apache.org/jira/browse/YUNIKORN-3230
Project: Apache YuniKorn
Issue Type: Bug
Components: shim - kubernetes
Environment: YuniKorn version: 1.8.0
Spark Operator version: 2.3.0
Gang scheduling: enabled
imitRange configuration
Example:
apiVersion: v1
kind: LimitRange
metadata:
name: spark-limit
spec:
limits:
- type: Pod
max:
cpu: "7"
memory: 7Gi
Reporter: Alex Ovchenkov
We are using YuniKorn with gang scheduling (placeholders enabled) together with
Kubernetes {{{}LimitRange{}}}.
{{LimitRange}} is configured intentionally to reject Pods that exceed the
maximum resources allowed per node (fail-fast behavior).
When a Spark application is submitted:
* Driver pod is created successfully.
* YuniKorn tries to create placeholder pods for executors.
* Placeholder creation is rejected by Kubernetes admission controller due to
{{{}LimitRange{}}}.
Expected behavior:
* The Spark application (YuniKorn application) should transition to a FAILED
state.
* Driver pod should not continue running indefinitely.
Actual behavior:
* Placeholder creation fails with a {{Forbidden}} error.
* Driver pod continues running.
* Application remains in inconsistent state (neither fully running nor
properly failed).
ERROR shim.cache.placeholder cache/placeholder_manager.go:92
failed to create placeholder pod
\{"error": "pods \"tg-spark-...\" is forbidden:
[maximum cpu usage per Pod is 7, but limit is 8,
maximum memory usage per Pod is 7Gi, but limit is 12025069568]"}
h3. Expected behavior
If placeholder creation fails with a permanent Kubernetes error (e.g.
{{{}Forbidden{}}}):
* YuniKorn application should transition to FAILED.
* Driver pod should be cleaned up (or at least the application should be
marked as failed).
* No partial execution should continue.
----
h3. Proposed behavior
When {{create placeholder pod}} returns a non-retryable error (e.g. HTTP 403
Forbidden):
* Treat this as terminal application failure.
* Move application state to FAILED.
* Emit clear event explaining the reason.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]