[ 
https://issues.apache.org/jira/browse/YUNIKORN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856032#comment-17856032
 ] 

Craig Condit commented on YUNIKORN-2682:
----------------------------------------

It looks like you're using the LimitRange controller with policies that force 
specifying limits for cpu and memory. YuniKorn does not support that in 1.3, so 
the placeholder creation fails. You will need to either: upgrade to YuniKorn 
1.4.0 or later (which will require Kubernetes to be >= 1.24.0), or disable the 
LimitRange controller.

> YuniKorn Gang Scheduling Issue: Executors Failing to Start When Running 
> Multiple Applications
> ---------------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2682
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2682
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 1.3.0
>            Reporter: huangzhir
>            Priority: Major
>         Attachments: image-2024-06-19-00-02-53-178.png, 
> image-2024-06-19-00-03-09-703.png
>
>
> h2. Description:
> While using YuniKorn's gang scheduling, we encountered a situation where the 
> scheduling process appears to succeed, but in reality, there is a problem. 
> When submitting two applications simultaneously, only the driver pods are 
> successfully running, and the executor pods fail to start due to insufficient 
> resources. The following error is observed in the scheduler logs:
> {code:java}
> 2024-06-18T15:15:27.933Z ERROR cache/placeholder_manager.go:99 failed to 
> create placeholder pod {"error": "pods 
> \"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed 
> quota: spark-quota: must specify limits.cpu,limits.memory"}
> github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
>   github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99
> github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
>   github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542 {code}
> h2. Environment:
>  * YuniKorn version: 1.3.0
>  * Kubernetes version: 1.21.3
>  * Spark version: 3.2.2
> h2. *resource-quota.yaml*
> {code:java}
> apiVersion: v1
> kind: ResourceQuota
> metadata:
>   name: spark-quota
>   namespace: spark
> spec:
>   hard:
>     requests.cpu: "5"
>     requests.memory: "5Gi"
>     limits.cpu: "5"
>     limits.memory: "5Gi" {code}
> h2. yunikorn-configs.yaml 
> {code:java}
> apiVersion: v1
> kind: ConfigMap
> metadata:
>   name: yunikorn-configs
>   namespace: yunikorn
> data:
>   log.level: "-1"
>   log.admission.level: "-1"
>   log.core.config.level: "-1"
>   queues.yaml: |
>     partitions:
>       - name: default
>         placementrules:
>           - name: tag
>             value: namespace
>             create: true
>         queues:
>           - name: root
>             submitacl: '*'
>             properties:
>               application.sort.policy: fifo
>               placeholderTimeoutInSeconds: 60
>               schedulingStyle: Hard
>             queues:
>               - name: spark
>                 properties:
>                   application.sort.policy: fifo
>                   placeholderTimeoutInSeconds: 60
>                   schedulingStyle: Hard
>                 resources:
>                   guaranteed:
>                     vcore: 5
>                     memory: 5Gi
>                   max:
>                     vcore: 5
>                     memory: 5Gi {code}
> h2. Spark-submit command
> {code:java}
> ./bin/spark-submit \
>   --master k8s://https://10.10.10.10:6443 \
>   --deploy-mode cluster \
>   --name spark-pi \
>   --conf spark.kubernetes.authenticate.driver.serviceAccountName=sparksa \
>   --conf spark.kubernetes.namespace=spark \
>   --class org.apache.spark.examples.SparkPi \
>   --conf spark.executor.instances=1 \
>   --conf spark.executor.cores=1 \
>   --conf spark.executor.memory=1500m \
>   --conf spark.driver.cores=1 \
>   --conf spark.driver.memory=1500m \
>   --conf spark.kubernetes.driver.limit.cores=1 \
>   --conf spark.kubernetes.driver.limit.memory=2G \
>   --conf spark.kubernetes.executor.limit.cores=1 \
>   --conf spark.kubernetes.executor.limit.memory=2G \
>    --conf spark.kubernetes.driver.label.app=spark \
>   --conf spark.kubernetes.executor.label.app=spark \
>   --conf spark.kubernetes.container.image=apache/spark:v3.3.2 \
>   --conf spark.kubernetes.scheduler.name=yunikorn \
>   --conf spark.kubernetes.driver.label.queue=root.spark \
>   --conf spark.kubernetes.executor.label.queue=root.spark \
>   --conf 
> spark.kubernetes.driver.annotation.yunikorn.apache.org/app-id={{APP_ID}} \
>   --conf 
> spark.kubernetes.executor.annotation.yunikorn.apache.org/app-id={{APP_ID}} \
>   --conf 
> spark.kubernetes.driver.annotation.yunikorn.apache.org/task-group-name=spark-driver
>  \
>   --conf 
> spark.kubernetes.driver.annotation.yunikorn.apache.org/task-groups='[{"name": 
> "spark-driver", "minMember": 1, "minResource": {"cpu": "1", "memory": 
> "2Gi"},"nodeSelector": {"app": "spark"} }, {"name": "spark-executor", 
> "minMember": 1, "minResource": {"cpu": "1", "memory": "2Gi"},"nodeSelector": 
> {"app": "spark"} }]' \
>   --conf 
> spark.kubernetes.driver.annotation.yunikorn.apache.org/schedulingPolicyParameters='placeholderTimeoutInSeconds=60
>  gangSchedulingStyle=Hard' \
>   --conf 
> spark.kubernetes.executor.annotation.yunikorn.apache.org/task-group-name=spark-executor
>  \
>   local:///opt/spark/examples/jars/spark-examples_2.12-3.3.2.jar \
>   3000 {code}
>  
> h2. scheduler logs
> kubectl  logs yunikorn-scheduler-56f599846b-8fl7d yunikorn-scheduler-k8s -n 
> yunikorn
> {code:java}
> 2024-06-18T15:15:26.201Z  DEBUG general/general.go:141  pod added {"appType": 
> "general", "Name": "spark-pi-f4f19b902beac663-driver", "Namespace": "spark"}
> 2024-06-18T15:15:26.201Z  DEBUG utils/utils.go:305  Unable to retrieve user 
> name from pod labels. Empty user label  {"userLabel": 
> "yunikorn.apache.org/username"}
> 2024-06-18T15:15:26.201Z  DEBUG cache/context.go:737  AddApplication  
> {"Request": 
> {"Metadata":{"ApplicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","QueueName":"root.spark","User":"nobody","Tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60
>  gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\": 
> \"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", 
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\": 
> \"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", 
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} 
> }]"},"Groups":null,"TaskGroups":[{"name":"spark-driver","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}},{"name":"spark-executor","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}}],"OwnerReferences":[{"apiVersion":"v1","kind":"Pod","name":"spark-pi-f4f19b902beac663-driver","uid":"4fb897dd-3af5-4799-a09d-640b5222ba3a","controller":false,"blockOwnerDeletion":true}],"SchedulingPolicyParameters":{},"CreationTime":0}}}
> 2024-06-18T15:15:26.201Z  DEBUG cache/context.go:746  app namespace info  
> {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "namespace": "spark"}
> 2024-06-18T15:15:26.201Z  INFO  cache/context.go:773  app added {"appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.201Z  DEBUG cache/context.go:841  AddTask {"appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:26.201Z  DEBUG cache/context.go:233  adding pod to cache 
> {"podName": "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:26.201Z  INFO  cache/context.go:863  task added  {"appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskState": "New"}
> 2024-06-18T15:15:26.201Z  INFO  cache/context.go:873  app request originating 
> pod added {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "original 
> task": "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:26.201Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (AddPod.Pre)  {"nodes": 3, "pods": 55, "assumed": 0, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases": 
> {"Pending":2,"Running":53}}
> 2024-06-18T15:15:26.201Z  DEBUG external/scheduler_cache.go:411 Putting pod 
> in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:26.201Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (AddPod.Post) {"nodes": 3, "pods": 56, "assumed": 0, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:26.277Z  DEBUG scheduler/scheduler.go:157  inspect 
> outstanding requests
> 2024-06-18T15:15:26.924Z  DEBUG cache/application_state.go:508  shim app 
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": 
> "New", "destination": "Submitted", "event": "SubmitApplication"}
> 2024-06-18T15:15:26.924Z  INFO  cache/application.go:424  handle app 
> submission {"app": "applicationID: spark-8e410a4c5ce44da2aa85ba835257a1e9, 
> queue: root.spark, partition: default, totalNumOfTasks: 1, currentState: 
> Submitted", "clusterID": "mycluster"}
> 2024-06-18T15:15:26.924Z  DEBUG scheduler/scheduler.go:96 enqueued event  
> {"eventType": "*rmevent.RMUpdateApplicationEvent", "event": 
> {"Request":{"new":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","queueName":"root.spark","partitionName":"[mycluster]default","ugi":{"user":"nobody"},"tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60
>  gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\": 
> \"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", 
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\": 
> \"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", 
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} 
> }]"},"executionTimeoutMilliSeconds":60000,"placeholderAsk":{"resources":{"memory":{"value":4294967296},"pods":{"value":2},"vcore":{"value":2000}}},"gangSchedulingStyle":"Hard"}],"rmID":"mycluster"}},
>  "currentQueueSize": 0}
> 2024-06-18T15:15:26.924Z  DEBUG placement/placement.go:145  Executing rule 
> for placing application  {"ruleName": "tag", "application": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.924Z  DEBUG placement/tag_rule.go:106 Tag rule 
> intermediate result  {"application": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue": "root.spark"}
> 2024-06-18T15:15:26.924Z  INFO  placement/tag_rule.go:115 Tag rule 
> application placed {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "queue": "root.spark"}
> 2024-06-18T15:15:26.924Z  DEBUG placement/placement.go:204  Rule result for 
> placing application {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "queueName": "root.spark"}
> 2024-06-18T15:15:26.925Z  INFO  scheduler/context.go:549  Added application 
> to partition  {"applicationID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "partitionName": "[mycluster]default", "requested queue": "root.spark", 
> "placed queue": "root.spark"}
> 2024-06-18T15:15:26.925Z  DEBUG rmproxy/rmproxy.go:59 enqueue event 
> {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": 
> {"RmID":"mycluster","AcceptedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9"}],"RejectedApplications":[],"UpdatedApplications":null},
>  "currentQueueSize": 0}
> 2024-06-18T15:15:26.925Z  DEBUG callback/scheduler_callback.go:108  
> UpdateApplication callback received {"UpdateApplicationResponse": 
> "accepted:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" > "}
> 2024-06-18T15:15:26.925Z  DEBUG callback/scheduler_callback.go:114  callback: 
> response to accepted application  {"appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.925Z  INFO  callback/scheduler_callback.go:118  Accepting 
> app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.925Z  DEBUG cache/application_state.go:508  shim app 
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": 
> "Submitted", "destination": "Accepted", "event": "AcceptApplication"}
> 2024-06-18T15:15:27.277Z  DEBUG scheduler/scheduler.go:157  inspect 
> outstanding requests
> 2024-06-18T15:15:27.925Z  DEBUG cache/application.go:516  postAppAccepted on 
> cached app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "numTaskGroups": 2, "numAllocatedTasks": 0}
> 2024-06-18T15:15:27.925Z  INFO  cache/application.go:526  app has taskGroups 
> defined, trying to reserve resources for gang members  {"appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:27.925Z  DEBUG cache/application_state.go:508  shim app 
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": 
> "Accepted", "destination": "Reserving", "event": "TryReserve"}
> 2024-06-18T15:15:27.933Z  ERROR cache/placeholder_manager.go:99 failed to 
> create placeholder pod  {"error": "pods 
> \"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed 
> quota: spark-quota: must specify limits.cpu,limits.memory"}
> github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
>   github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99
> github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
>   github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542
> 2024-06-18T15:15:27.933Z  INFO  cache/placeholder_manager.go:115  start to 
> clean up app placeholders  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:27.933Z  INFO  cache/placeholder_manager.go:128  finished 
> cleaning up app placeholders {"appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:27.933Z  DEBUG cache/application_state.go:508  shim app 
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": 
> "Reserving", "destination": "Running", "event": "RunApplication"}
> 2024-06-18T15:15:28.278Z  DEBUG scheduler/scheduler.go:157  inspect 
> outstanding requests
> 2024-06-18T15:15:28.867Z  DEBUG scheduler/partition_manager.go:83 time 
> consumed for queue cleaner {"duration": "6.41µs"}
> 2024-06-18T15:15:28.925Z  INFO  cache/task_state.go:380 Task state transition 
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": 
> "spark/spark-pi-f4f19b902beac663-driver", "source": "New", "destination": 
> "Pending", "event": "InitTask"}
> 2024-06-18T15:15:28.926Z  INFO  cache/task_state.go:380 Task state transition 
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": 
> "spark/spark-pi-f4f19b902beac663-driver", "source": "Pending", "destination": 
> "Scheduling", "event": "SubmitTask"}
> 2024-06-18T15:15:28.926Z  DEBUG cache/task.go:275 scheduling pod  {"podName": 
> "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:28.926Z  DEBUG cache/task.go:294 send update request 
> {"request": "asks:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\" 
> applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" 
> resourceAsk:<resources:<key:\"memory\" value:<value:1975517184 > > 
> resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\" 
> value:<value:1000 > > > maxAllocations:1 
> tags:<key:\"kubernetes.io/label/app\" value:\"spark\" > 
> tags:<key:\"kubernetes.io/label/queue\" value:\"root.spark\" > 
> tags:<key:\"kubernetes.io/label/spark-app-name\" value:\"spark-pi\" > 
> tags:<key:\"kubernetes.io/label/spark-app-selector\" 
> value:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" > 
> tags:<key:\"kubernetes.io/label/spark-role\" value:\"driver\" > 
> tags:<key:\"kubernetes.io/label/spark-version\" value:\"3.3.2\" > 
> tags:<key:\"kubernetes.io/meta/namespace\" value:\"spark\" > 
> tags:<key:\"kubernetes.io/meta/podName\" 
> value:\"spark-pi-f4f19b902beac663-driver\" > taskGroupName:\"spark-driver\" 
> Originator:true preemptionPolicy:<allowPreemptSelf:true 
> allowPreemptOther:true > > rmID:\"mycluster\" "}
> 2024-06-18T15:15:28.926Z  DEBUG scheduler/scheduler.go:96 enqueued event  
> {"eventType": "*rmevent.RMUpdateAllocationEvent", "event": 
> {"Request":{"asks":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","partitionName":"[mycluster]default","resourceAsk":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"maxAllocations":1,"tags":{"kubernetes.io/label/app":"spark","kubernetes.io/label/queue":"root.spark","kubernetes.io/label/spark-app-name":"spark-pi","kubernetes.io/label/spark-app-selector":"spark-8e410a4c5ce44da2aa85ba835257a1e9","kubernetes.io/label/spark-role":"driver","kubernetes.io/label/spark-version":"3.3.2","kubernetes.io/meta/namespace":"spark","kubernetes.io/meta/podName":"spark-pi-f4f19b902beac663-driver"},"taskGroupName":"spark-driver","Originator":true,"preemptionPolicy":{"allowPreemptSelf":true,"allowPreemptOther":true}}],"rmID":"mycluster"}},
>  "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z  INFO  objects/application_state.go:133  Application 
> state transition  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "source": "New", "destination": "Accepted", "event": "runApplication"}
> 2024-06-18T15:15:28.926Z  DEBUG rmproxy/rmproxy.go:59 enqueue event 
> {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": 
> {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Accepted","stateTransitionTimestamp":1718723728926391633,"message":"runApplication"}]},
>  "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z  INFO  objects/application.go:668  ask added 
> successfully to application {"appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "ask": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "placeholder": false, "pendingDelta": 
> "map[memory:1975517184 pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:108  
> UpdateApplication callback received {"UpdateApplicationResponse": 
> "updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" 
> state:\"Accepted\" stateTransitionTimestamp:1718723728926391633 
> message:\"runApplication\" > "}
> 2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:137  status 
> update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "new status": "Accepted"}
> 2024-06-18T15:15:28.926Z  DEBUG objects/application.go:339  Application state 
> timer initiated {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "state": 
> "Starting", "timeout": "5m0s"}
> 2024-06-18T15:15:28.926Z  INFO  objects/application_state.go:133  Application 
> state transition  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "source": "Accepted", "destination": "Starting", "event": "runApplication"}
> 2024-06-18T15:15:28.926Z  DEBUG rmproxy/rmproxy.go:59 enqueue event 
> {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": 
> {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Starting","stateTransitionTimestamp":1718723728926635454,"message":"runApplication"}]},
>  "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/manager.go:63 Increasing resource usage 
> {"user": "nobody", "queue path": "root.spark", "application": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue 
> tracker object for queue {"queue": "root"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource 
> usage {"queue path": "root.spark", "application": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue 
> tracker object for queue {"queue": "spark"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource 
> usage {"queue path": "spark", "application": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/manager.go:257  Group tracker does not 
> exist. Creating group tracker object and linking the same with application 
> {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue path": 
> "root.spark", "user": "nobody", "group": "nobody"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue 
> tracker object for queue {"queue": "root"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource 
> usage {"queue path": "root.spark", "application": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue 
> tracker object for queue {"queue": "spark"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource 
> usage {"queue path": "spark", "application": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:108  
> UpdateApplication callback received {"UpdateApplicationResponse": 
> "updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" 
> state:\"Starting\" stateTransitionTimestamp:1718723728926635454 
> message:\"runApplication\" > "}
> 2024-06-18T15:15:28.926Z  DEBUG objects/queue.go:1239 allocation found on 
> queue {"queueName": "root.spark", "appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "allocation": 
> "applicationID=spark-8e410a4c5ce44da2aa85ba835257a1e9, 
> uuid=3e42c4e0-0fc3-466c-a524-444b2eec700b, 
> allocationKey=4fb897dd-3af5-4799-a09d-640b5222ba3a, Node=10.10.10.66, 
> result=Allocated"}
> 2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:137  status 
> update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "new status": "Starting"}
> 2024-06-18T15:15:28.926Z  INFO  scheduler/partition.go:888  scheduler 
> allocation processed  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "allocationKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "uuid": 
> "3e42c4e0-0fc3-466c-a524-444b2eec700b", "allocatedResource": 
> "map[memory:1975517184 pods:1 vcore:1000]", "placeholder": false, 
> "targetNode": "10.10.10.66"}
> 2024-06-18T15:15:28.926Z  DEBUG rmproxy/rmproxy.go:59 enqueue event 
> {"eventType": "*rmevent.RMNewAllocationsEvent", "event": 
> {"RmID":"mycluster","Allocations":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","UUID":"3e42c4e0-0fc3-466c-a524-444b2eec700b","resourcePerAlloc":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"nodeID":"10.10.10.66","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","taskGroupName":"spark-driver"}]},
>  "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:48 
> UpdateAllocation callback received  {"UpdateAllocationResponse": 
> "new:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\" 
> UUID:\"3e42c4e0-0fc3-466c-a524-444b2eec700b\" 
> resourcePerAlloc:<resources:<key:\"memory\" value:<value:1975517184 > > 
> resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\" 
> value:<value:1000 > > > nodeID:\"10.10.10.66\" 
> applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" 
> taskGroupName:\"spark-driver\" > "}
> 2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:53 callback: 
> response to new allocation  {"allocationKey": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "UUID": 
> "3e42c4e0-0fc3-466c-a524-444b2eec700b", "applicationID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "nodeID": "10.10.10.66"}
> 2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (AssumePod.Pre) {"nodes": 3, "pods": 56, "assumed": 0, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:476 Adding 
> assumed pod to cache {"podName": "spark-pi-f4f19b902beac663-driver", 
> "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "node": "10.10.10.66", 
> "allBound": true}
> 2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:411 Putting pod 
> in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (AssumePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.926Z  DEBUG scheduler/context.go:853  Successfully synced 
> shim on new allocation. response: no. of allocations: 1
> 2024-06-18T15:15:28.926Z  INFO  cache/task_state.go:380 Task state transition 
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": 
> "spark/spark-pi-f4f19b902beac663-driver", "source": "Scheduling", 
> "destination": "Allocated", "event": "TaskAllocated"}
> 2024-06-18T15:15:28.926Z  DEBUG cache/task.go:349 bind pod volumes  
> {"podName": "spark-pi-f4f19b902beac663-driver", "podUID": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.926Z  INFO  cache/context.go:499  Binding Pod Volumes 
> skipped: all volumes already bound  {"podName": 
> "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:28.926Z  DEBUG cache/task.go:362 bind pod  {"podName": 
> "spark-pi-f4f19b902beac663-driver", "podUID": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.926Z  INFO  client/kubeclient.go:112  bind pod to node  
> {"podName": "spark-pi-f4f19b902beac663-driver", "podUID": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "nodeID": "10.10.10.66"}
> 2024-06-18T15:15:28.933Z  INFO  cache/task.go:375 successfully bound pod  
> {"podName": "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:28.934Z  INFO  cache/task_state.go:380 Task state transition 
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": 
> "spark/spark-pi-f4f19b902beac663-driver", "source": "Allocated", 
> "destination": "Bound", "event": "TaskBound"}
> 2024-06-18T15:15:28.934Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.934Z  DEBUG external/scheduler_cache.go:411 Putting pod 
> in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.934Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.946Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.946Z  DEBUG external/scheduler_cache.go:411 Putting pod 
> in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.946Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:29.278Z  DEBUG scheduler/scheduler.go:157  inspect 
> outstanding requests
> 2024-06-18T15:15:29.906Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:29.906Z  DEBUG external/scheduler_cache.go:411 Putting pod 
> in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:29.906Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:30.113Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:30.113Z  DEBUG external/scheduler_cache.go:411 Putting pod 
> in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:30.113Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 0, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":2,"Running":54}}
> 2024-06-18T15:15:30.279Z  DEBUG scheduler/scheduler.go:157  inspect 
> outstanding requests
> 2024-06-18T15:15:31.279Z  DEBUG scheduler/scheduler.go:157  inspect 
> outstanding requests
> 2024-06-18T15:15:32.280Z  DEBUG scheduler/scheduler.go:157  inspect 
> outstanding requests
> 2024-06-18T15:15:32.395Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdateNode.Pre)  {"nodes": 3, "pods": 56, "assumed": 0, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":2,"Running":54}}
> 2024-06-18T15:15:32.395Z  DEBUG external/scheduler_cache.go:179 Updating node 
> in cache  {"nodeName": "10.10.10.66"}
> 2024-06-18T15:15:32.395Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdateNode.Post) {"nodes": 3, "pods": 56, "assumed": 0, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":2,"Running":54}}
> 2024-06-18T15:15:32.395Z  DEBUG cache/node.go:109 set node capacity 
> {"nodeID": "10.10.10.66", "capacity": "resources:<key:\"ephemeral-storage\" 
> value:<value:478652923105 > > resources:<key:\"hugepages-1Gi\" value:<> > 
> resources:<key:\"hugepages-2Mi\" value:<> > 
> resources:<key:\"kubernetes.io/batch-cpu\" value:<value:5167 > > 
> resources:<key:\"kubernetes.io/batch-memory\" value:<value:34437364710 > > 
> resources:<key:\"kubernetes.io/mid-cpu\" value:<value:3156 > > 
> resources:<key:\"kubernetes.io/mid-memory\" value:<value:1442139746 > > 
> resources:<key:\"memory\" value:<value:64713940992 > > 
> resources:<key:\"pods\" value:<value:110 > > resources:<key:\"vcore\" 
> value:<value:14000 > > "}
> 2024-06-18T15:15:32.396Z  INFO  cache/nodes.go:179  Node's ready status flag  
> {"Node name": "10.10.10.66", "ready": true}
> 2024-06-18T15:15:32.396Z  INFO  cache/nodes.go:184  report updated nodes to 
> scheduler {"request": 
> {"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}}
> 2024-06-18T15:15:32.396Z  DEBUG scheduler/scheduler.go:96 enqueued event  
> {"eventType": "*rmevent.RMUpdateNodeEvent", "event": 
> {"Request":{"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true","si/node-partition":"[mycluster]default"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}},
>  "currentQueueSize": 0}
> 2024-06-18T15:15:32.396Z  INFO  objects/queue.go:1190 updating root queue max 
> resources {"current max": "map[ephemeral-storage:1435958769315 
> hugepages-1Gi:0 hugepages-2Mi:0 kubernetes.io/batch-cpu:19610 
> kubernetes.io/batch-memory:109317941106 kubernetes.io/mid-cpu:9574 
> kubernetes.io/mid-memory:5209251314 memory:194141839360 pods:330 
> vcore:42000]", "new max": "map[ephemeral-storage:1435958769315 
> hugepages-1Gi:0 hugepages-2Mi:0 kubernetes.io/batch-cpu:19309 
> kubernetes.io/batch-memory:109530403859 kubernetes.io/mid-cpu:10574 
> kubernetes.io/mid-memory:6651391060 memory:194141839360 pods:330 
> vcore:42000]"}
> 2024-06-18T15:15:32.396Z  DEBUG objects/node.go:182 skip updating 
> occupiedResource, not changed {code}
> h2. Spark Pod status
> kubectl get pod -n spark|grep Running
> !image-2024-06-19-00-02-53-178.png!
> kubectl describe pod spark-pi-xxxxxxxx-driver -n spark
> !image-2024-06-19-00-03-09-703.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to