[jira] [Updated] (YUNIKORN-2940) ph Pod is in a pending state for a long time

Xiaobao Wu (Jira) Thu, 24 Oct 2024 05:13:46 -0700


     [ 
https://issues.apache.org/jira/browse/YUNIKORN-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiaobao Wu updated YUNIKORN-2940:
---------------------------------
    Description: 
*environment information*
 * resourceQuotas：4C 4G
 * driver / executor Pod：1C 1G
 * driver /  executor ph Pod (in task-groups): 1C 1G

*issue description*

In the above environment, I submitted a Spark job, and the job information is 
as follows :
{code:java}
/opt/spark/bin/spark-submit --master k8s://https://127.0.0.1:6443 --deploy-mode 
cluster --name spark-pi \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.namespace=spark-my-test  \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.dynamicAllocation.shuffleTracking.enabled=true \
  --conf spark.dynamicAllocation.enabled=true \
  --conf spark.dynamicAllocation.maxExecutors=10 \
  --conf spark.dynamicAllocation.minExecutors=10 \
  --conf spark.executor.cores=1 \
  --conf spark.executor.memory=600m \
  --conf spark.driver.cores=1 \
  --conf spark.driver.memory=600m \
  --conf spark.app.id={{APP_ID}} \
  --conf spark.ui.port=14040 \
  --conf spark.kubernetes.driver.limit.cores=1 \
  --conf spark.kubernetes.executor.limit.cores=1 \
  --conf spark.kubernetes.container.image=apache/spark:v3.3.0 \
  --conf spark.kubernetes.scheduler.name=yunikorn \
  --conf 
spark.kubernetes.driver.annotation.yunikorn.apache.org/task-group-name=spark-driver
 \
  --conf 
spark.kubernetes.driver.annotation.yunikorn.apache.org/task-groups='[{"name": 
"spark-driver", "minMember": 1, "minResource": {"cpu": "1", "memory": 
"1Gi"} }, {"name": "spark-executor", "minMember": 10, "minResource": {"cpu": 
"1", "memory": "1Gi"} }]' \
  --conf 
spark.kubernetes.driver.annotation.yunikorn.apache.org/schedulingPolicyParameters='placeholderTimeoutInSeconds=30
 gangSchedulingStyle=Hard' \
  --conf 
spark.kubernetes.executor.annotation.yunikorn.apache.org/task-group-name=spark-executor
 \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar  10000 {code}
After I ran this job, I found that ph Pod ( i.e. tg in the following picture ) 
still exists on K8S.
!http://www.kdocs.cn/api/v3/office/copy/NjNBZFlyNDdCMXRSZEp0cEdTYVVocE94MkY3OVZzTHNMM2oyWFQ0ZVY2K0x6eE9qNTNDMDFzN3Z3QzA1ZCtrdEdwbU9FNm5xUHI4cTZKTDV6dnYvWHFjZUZlYjJMZS9UYXBrMERSVWYxNkhhd0pycnNkVEtxblh0d212K3dQZ0o0eXB6VWVEanlJbTRnSGlYVG12YUNrbS9tRnZzMkNneU82aGNWZzNIYmNVQmlnbmlVZ0VNS1lJZ0NNQzBSKzYwbGJ5SVd5MXFwSjhZUFllb2Rwc0Q1UCtwMlh4WkljSWxQN2FEczVBODhRdk5pSlVOcVllZVNjaklWVGFhQ0paaC9DZUpXS1hDRldrPQ==/attach/object/4BQON4Y3AAADQ?|width=664!
I think it is very strange why the job has been completed, but phPod still 
exists.
 
*issue analysis*
By looking at the log, I found that there is a key log here：
{code:java}
2024-10-21T20:50:19.868+0800    INFO    shim.cache.placeholder  
cache/placeholder_manager.go:99 placeholder created     {"placeholder": "appID: 
spark-96aae620780e4b40a59893d850e8aad3, taskGroup: spark-executor, podName: 
spark-my-test/tg-spark-96aae620780e4b40a59893-spark-executor-90qrmfkytv"}2024-10-21T20:50:19.880+0800
       ERROR   shim.cache.placeholder  cache/placeholder_manager.go:95 failed 
to create placeholder pod        {"error": "pods 
\"tg-spark-96aae620780e4b40a59893-spark-executor-zkpqzmw308\" is forbidden: 
exceeded quota: compute-resources, requested: 
limits.cpu=1,limits.memory=1Gi,requests.cpu=1,requests.memory=1Gi, used: 
limits.cpu=4,limits.memory=4056Mi,requests.cpu=4,requests.memory=4056Mi, 
limited: 
limits.cpu=4,limits.memory=4Gi,requests.cpu=4,requests.memory=4Gi"}github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
 
/opt/src/pkg/cache/placeholder_manager.go:95github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
        /opt/src/pkg/cache/application.go:5372024-10-21T20:50:19.880+0800       
INFO    shim.cache.placeholder  cache/placeholder_manager.go:111        start 
to clean up app placeholders      {"appID": 
"spark-96aae620780e4b40a59893d850e8aad3"}2024-10-21T20:50:19.973+0800 INFO    
shim.utils      utils/utils.go:293      found user info from pod annotations    
{"username": "system:serviceaccount:yunikorn:yunikorn-admin", "groups": 
["system:serviceaccounts", "system:serviceaccounts:yunikorn", 
"system:authenticated"]}2024-10-21T20:50:19.973+0800      DEBUG   shim.context  
  cache/context.go:1109   AddTask {"appID": 
"spark-96aae620780e4b40a59893d850e8aad3", "taskID": 
"08490d01-bb3b-490b-a9b0-b9bd183cccd6"}2024-10-21T20:50:19.973+0800       INFO  
  shim.context    cache/context.go:1131   task added      {"appID": 
"spark-96aae620780e4b40a59893d850e8aad3", "taskID": 
"08490d01-bb3b-490b-a9b0-b9bd183cccd6", "taskState": 
"New"}2024-10-21T20:50:19.975+0800   INFO    shim.utils      utils/utils.go:293 
     found user info from pod annotations    {"username": 
"system:serviceaccount:yunikorn:yunikorn-admin", "groups": 
["system:serviceaccounts", "system:serviceaccounts:yunikorn", 
"system:authenticated"]}2024-10-21T20:50:20.001+0800      INFO    
shim.cache.task cache/task.go:533       releasing allocations   
{"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 
0}2024-10-21T20:50:20.001+0800   INFO    shim.fsm        
cache/task_state.go:380 Task state transition   {"app": 
"spark-96aae620780e4b40a59893d850e8aad3", "task": 
"b0ad9f47-c884-4fbd-be79-dba14c500de8", "taskAlias": 
"spark-my-test/tg-spark-96aae620780e4b40a59893-spark-driver-furk07f78s", 
"source": "New", "destination": "Completed", "event": 
"CompleteTask"}2024-10-21T20:50:20.052+0800       INFO    shim.utils      
utils/utils.go:293      found user info from pod annotations    {"username": 
"system:serviceaccount:yunikorn:yunikorn-admin", "groups": 
["system:serviceaccounts", "system:serviceaccounts:yunikorn", 
"system:authenticated"]}2024-10-21T20:50:20.052+0800      INFO    shim.utils    
  cache/gang_utils.go:117 gang scheduling style, using: 
Hard2024-10-21T20:50:20.058+0800  INFO    shim.cache.placeholder  
cache/placeholder_manager.go:124        finished cleaning up app placeholders   
{"appID": "spark-96aae620780e4b40a59893d850e8aad3"}2024-10-21T20:50:20.058+0800 
DEBUG   shim.fsm        cache/application_state.go:500  shim app state 
transition       {"app": "spark-96aae620780e4b40a59893d850e8aad3", "source": 
"Reserving", "destination": "Running", "event": "UpdateReservation"}
 {code}

  was:
*environment information*
 * resourceQuotas：4C 4G
 * driver / executor Pod：1C 1G
 * driver /  executor ph Pod (in task-groups): 1C 1G

*problem description*

In the above environment, I submitted a Spark job, and the job information is 
as follows :
{code:java}
/opt/spark/bin/spark-submit --master k8s://https://127.0.0.1:6443 --deploy-mode 
cluster --name spark-pi \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.namespace=spark-my-test  \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.dynamicAllocation.shuffleTracking.enabled=true \
  --conf spark.dynamicAllocation.enabled=true \
  --conf spark.dynamicAllocation.maxExecutors=10 \
  --conf spark.dynamicAllocation.minExecutors=10 \
  --conf spark.executor.cores=1 \
  --conf spark.executor.memory=600m \
  --conf spark.driver.cores=1 \
  --conf spark.driver.memory=600m \
  --conf spark.app.id={{APP_ID}} \
  --conf spark.ui.port=14040 \
  --conf spark.kubernetes.driver.limit.cores=1 \
  --conf spark.kubernetes.executor.limit.cores=1 \
  --conf spark.kubernetes.container.image=apache/spark:v3.3.0 \
  --conf spark.kubernetes.scheduler.name=yunikorn \
  --conf 
spark.kubernetes.driver.annotation.yunikorn.apache.org/task-group-name=spark-driver
 \
  --conf 
spark.kubernetes.driver.annotation.yunikorn.apache.org/task-groups='[{"name": 
"spark-driver", "minMember": 1, "minResource": {"cpu": "1", "memory": 
"1Gi"} }, {"name": "spark-executor", "minMember": 10, "minResource": {"cpu": 
"1", "memory": "1Gi"} }]' \
  --conf 
spark.kubernetes.driver.annotation.yunikorn.apache.org/schedulingPolicyParameters='placeholderTimeoutInSeconds=30
 gangSchedulingStyle=Hard' \
  --conf 
spark.kubernetes.executor.annotation.yunikorn.apache.org/task-group-name=spark-executor
 \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar  10000 {code}
After I ran this job, I found that ph Pod ( i.e. tg in the following picture ) 
still exists on K8S.

 


> ph Pod is in a pending state for a long time
> --------------------------------------------
>
>                 Key: YUNIKORN-2940
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2940
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 1.5.2
>            Reporter: Xiaobao Wu
>            Priority: Critical
>
> *environment information*
>  * resourceQuotas：4C 4G
>  * driver / executor Pod：1C 1G
>  * driver /  executor ph Pod (in task-groups): 1C 1G
> *issue description*
> In the above environment, I submitted a Spark job, and the job information is 
> as follows :
> {code:java}
> /opt/spark/bin/spark-submit --master k8s://https://127.0.0.1:6443 
> --deploy-mode cluster --name spark-pi \
>   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
>   --conf spark.kubernetes.namespace=spark-my-test  \
>   --class org.apache.spark.examples.SparkPi \
>   --conf spark.dynamicAllocation.shuffleTracking.enabled=true \
>   --conf spark.dynamicAllocation.enabled=true \
>   --conf spark.dynamicAllocation.maxExecutors=10 \
>   --conf spark.dynamicAllocation.minExecutors=10 \
>   --conf spark.executor.cores=1 \
>   --conf spark.executor.memory=600m \
>   --conf spark.driver.cores=1 \
>   --conf spark.driver.memory=600m \
>   --conf spark.app.id={{APP_ID}} \
>   --conf spark.ui.port=14040 \
>   --conf spark.kubernetes.driver.limit.cores=1 \
>   --conf spark.kubernetes.executor.limit.cores=1 \
>   --conf spark.kubernetes.container.image=apache/spark:v3.3.0 \
>   --conf spark.kubernetes.scheduler.name=yunikorn \
>   --conf 
> spark.kubernetes.driver.annotation.yunikorn.apache.org/task-group-name=spark-driver
>  \
>   --conf 
> spark.kubernetes.driver.annotation.yunikorn.apache.org/task-groups='[{"name": 
> "spark-driver", "minMember": 1, "minResource": {"cpu": "1", "memory": 
> "1Gi"} }, {"name": "spark-executor", "minMember": 10, "minResource": {"cpu": 
> "1", "memory": "1Gi"} }]' \
>   --conf 
> spark.kubernetes.driver.annotation.yunikorn.apache.org/schedulingPolicyParameters='placeholderTimeoutInSeconds=30
>  gangSchedulingStyle=Hard' \
>   --conf 
> spark.kubernetes.executor.annotation.yunikorn.apache.org/task-group-name=spark-executor
>  \
>   local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar  10000 {code}
> After I ran this job, I found that ph Pod ( i.e. tg in the following picture 
> ) still exists on K8S.
> !http://www.kdocs.cn/api/v3/office/copy/NjNBZFlyNDdCMXRSZEp0cEdTYVVocE94MkY3OVZzTHNMM2oyWFQ0ZVY2K0x6eE9qNTNDMDFzN3Z3QzA1ZCtrdEdwbU9FNm5xUHI4cTZKTDV6dnYvWHFjZUZlYjJMZS9UYXBrMERSVWYxNkhhd0pycnNkVEtxblh0d212K3dQZ0o0eXB6VWVEanlJbTRnSGlYVG12YUNrbS9tRnZzMkNneU82aGNWZzNIYmNVQmlnbmlVZ0VNS1lJZ0NNQzBSKzYwbGJ5SVd5MXFwSjhZUFllb2Rwc0Q1UCtwMlh4WkljSWxQN2FEczVBODhRdk5pSlVOcVllZVNjaklWVGFhQ0paaC9DZUpXS1hDRldrPQ==/attach/object/4BQON4Y3AAADQ?|width=664!
> I think it is very strange why the job has been completed, but phPod still 
> exists.
>  
> *issue analysis*
> By looking at the log, I found that there is a key log here：
> {code:java}
> 2024-10-21T20:50:19.868+0800  INFO    shim.cache.placeholder  
> cache/placeholder_manager.go:99 placeholder created     {"placeholder": 
> "appID: spark-96aae620780e4b40a59893d850e8aad3, taskGroup: spark-executor, 
> podName: 
> spark-my-test/tg-spark-96aae620780e4b40a59893-spark-executor-90qrmfkytv"}2024-10-21T20:50:19.880+0800
>        ERROR   shim.cache.placeholder  cache/placeholder_manager.go:95 failed 
> to create placeholder pod        {"error": "pods 
> \"tg-spark-96aae620780e4b40a59893-spark-executor-zkpqzmw308\" is forbidden: 
> exceeded quota: compute-resources, requested: 
> limits.cpu=1,limits.memory=1Gi,requests.cpu=1,requests.memory=1Gi, used: 
> limits.cpu=4,limits.memory=4056Mi,requests.cpu=4,requests.memory=4056Mi, 
> limited: 
> limits.cpu=4,limits.memory=4Gi,requests.cpu=4,requests.memory=4Gi"}github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
>  
> /opt/src/pkg/cache/placeholder_manager.go:95github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
>         /opt/src/pkg/cache/application.go:5372024-10-21T20:50:19.880+0800     
>   INFO    shim.cache.placeholder  cache/placeholder_manager.go:111        
> start to clean up app placeholders      {"appID": 
> "spark-96aae620780e4b40a59893d850e8aad3"}2024-10-21T20:50:19.973+0800 INFO    
> shim.utils      utils/utils.go:293      found user info from pod annotations  
>   {"username": "system:serviceaccount:yunikorn:yunikorn-admin", "groups": 
> ["system:serviceaccounts", "system:serviceaccounts:yunikorn", 
> "system:authenticated"]}2024-10-21T20:50:19.973+0800      DEBUG   
> shim.context    cache/context.go:1109   AddTask {"appID": 
> "spark-96aae620780e4b40a59893d850e8aad3", "taskID": 
> "08490d01-bb3b-490b-a9b0-b9bd183cccd6"}2024-10-21T20:50:19.973+0800       
> INFO    shim.context    cache/context.go:1131   task added      {"appID": 
> "spark-96aae620780e4b40a59893d850e8aad3", "taskID": 
> "08490d01-bb3b-490b-a9b0-b9bd183cccd6", "taskState": 
> "New"}2024-10-21T20:50:19.975+0800   INFO    shim.utils      
> utils/utils.go:293      found user info from pod annotations    {"username": 
> "system:serviceaccount:yunikorn:yunikorn-admin", "groups": 
> ["system:serviceaccounts", "system:serviceaccounts:yunikorn", 
> "system:authenticated"]}2024-10-21T20:50:20.001+0800      INFO    
> shim.cache.task cache/task.go:533       releasing allocations   
> {"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 
> 0}2024-10-21T20:50:20.001+0800   INFO    shim.fsm        
> cache/task_state.go:380 Task state transition   {"app": 
> "spark-96aae620780e4b40a59893d850e8aad3", "task": 
> "b0ad9f47-c884-4fbd-be79-dba14c500de8", "taskAlias": 
> "spark-my-test/tg-spark-96aae620780e4b40a59893-spark-driver-furk07f78s", 
> "source": "New", "destination": "Completed", "event": 
> "CompleteTask"}2024-10-21T20:50:20.052+0800       INFO    shim.utils      
> utils/utils.go:293      found user info from pod annotations    {"username": 
> "system:serviceaccount:yunikorn:yunikorn-admin", "groups": 
> ["system:serviceaccounts", "system:serviceaccounts:yunikorn", 
> "system:authenticated"]}2024-10-21T20:50:20.052+0800      INFO    shim.utils  
>     cache/gang_utils.go:117 gang scheduling style, using: 
> Hard2024-10-21T20:50:20.058+0800  INFO    shim.cache.placeholder  
> cache/placeholder_manager.go:124        finished cleaning up app placeholders 
>   {"appID": 
> "spark-96aae620780e4b40a59893d850e8aad3"}2024-10-21T20:50:20.058+0800 DEBUG   
> shim.fsm        cache/application_state.go:500  shim app state transition     
>   {"app": "spark-96aae620780e4b40a59893d850e8aad3", "source": "Reserving", 
> "destination": "Running", "event": "UpdateReservation"}
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YUNIKORN-2940) ph Pod is in a pending state for a long time

Reply via email to