Alex Ovchenkov created YUNIKORN-3230:
----------------------------------------

             Summary: YuniKorn does not fail application when placeholder 
creation is rejected by LimitRange
                 Key: YUNIKORN-3230
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3230
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: shim - kubernetes
         Environment: YuniKorn version: 1.8.0
Spark Operator version: 2.3.0
Gang scheduling: enabled

imitRange configuration

Example:

apiVersion: v1
kind: LimitRange
metadata:
  name: spark-limit
spec:
  limits:
    - type: Pod
      max:
        cpu: "7"
        memory: 7Gi
            Reporter: Alex Ovchenkov


We are using YuniKorn with gang scheduling (placeholders enabled) together with 
Kubernetes {{{}LimitRange{}}}.

{{LimitRange}} is configured intentionally to reject Pods that exceed the 
maximum resources allowed per node (fail-fast behavior).

When a Spark application is submitted:
 * Driver pod is created successfully.

 * YuniKorn tries to create placeholder pods for executors.

 * Placeholder creation is rejected by Kubernetes admission controller due to 
{{{}LimitRange{}}}.

Expected behavior:
 * The Spark application (YuniKorn application) should transition to a FAILED 
state.

 * Driver pod should not continue running indefinitely.

Actual behavior:
 * Placeholder creation fails with a {{Forbidden}} error.

 * Driver pod continues running.

 * Application remains in inconsistent state (neither fully running nor 
properly failed).

ERROR shim.cache.placeholder cache/placeholder_manager.go:92
failed to create placeholder pod
\{"error": "pods \"tg-spark-...\" is forbidden:
[maximum cpu usage per Pod is 7, but limit is 8,
maximum memory usage per Pod is 7Gi, but limit is 12025069568]"}
 
h3. Expected behavior

If placeholder creation fails with a permanent Kubernetes error (e.g. 
{{{}Forbidden{}}}):
 * YuniKorn application should transition to FAILED.

 * Driver pod should be cleaned up (or at least the application should be 
marked as failed).

 * No partial execution should continue.

----
h3. Proposed behavior

When {{create placeholder pod}} returns a non-retryable error (e.g. HTTP 403 
Forbidden):
 * Treat this as terminal application failure.

 * Move application state to FAILED.

 * Emit clear event explaining the reason.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to