[ 
https://issues.apache.org/jira/browse/YUNIKORN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit closed YUNIKORN-2682.
----------------------------------
      Assignee: Craig Condit
    Resolution: Workaround

> YuniKorn Gang Scheduling Issue: Executors Failing to Start When Running 
> Multiple Applications
> ---------------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2682
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2682
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 1.3.0
>            Reporter: huangzhir
>            Assignee: Craig Condit
>            Priority: Major
>         Attachments: image-2024-06-19-00-02-53-178.png, 
> image-2024-06-19-00-03-09-703.png
>
>
> h2. Description:
> While using YuniKorn's gang scheduling, we encountered a situation where the 
> scheduling process appears to succeed, but in reality, there is a problem. 
> When submitting two applications simultaneously, only the driver pods are 
> successfully running, and the executor pods fail to start due to insufficient 
> resources. The following error is observed in the scheduler logs:
> {code:java}
> 2024-06-18T15:15:27.933Z ERROR cache/placeholder_manager.go:99 failed to 
> create placeholder pod {"error": "pods 
> \"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed 
> quota: spark-quota: must specify limits.cpu,limits.memory"}
> github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
>   github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99
> github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
>   github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542 {code}
> h2. Environment:
>  * YuniKorn version: 1.3.0
>  * Kubernetes version: 1.21.3
>  * Spark version: 3.2.2
> h2. *resource-quota.yaml*
> {code:java}
> apiVersion: v1
> kind: ResourceQuota
> metadata:
>   name: spark-quota
>   namespace: spark
> spec:
>   hard:
>     requests.cpu: "5"
>     requests.memory: "5Gi"
>     limits.cpu: "5"
>     limits.memory: "5Gi" {code}
> h2. yunikorn-configs.yaml 
> {code:java}
> apiVersion: v1
> kind: ConfigMap
> metadata:
>   name: yunikorn-configs
>   namespace: yunikorn
> data:
>   log.level: "-1"
>   log.admission.level: "-1"
>   log.core.config.level: "-1"
>   queues.yaml: |
>     partitions:
>       - name: default
>         placementrules:
>           - name: tag
>             value: namespace
>             create: true
>         queues:
>           - name: root
>             submitacl: '*'
>             properties:
>               application.sort.policy: fifo
>               placeholderTimeoutInSeconds: 60
>               schedulingStyle: Hard
>             queues:
>               - name: spark
>                 properties:
>                   application.sort.policy: fifo
>                   placeholderTimeoutInSeconds: 60
>                   schedulingStyle: Hard
>                 resources:
>                   guaranteed:
>                     vcore: 5
>                     memory: 5Gi
>                   max:
>                     vcore: 5
>                     memory: 5Gi {code}
> h2. Spark-submit command
> {code:java}
> ./bin/spark-submit \
>   --master k8s://https://10.10.10.10:6443 \
>   --deploy-mode cluster \
>   --name spark-pi \
>   --conf spark.kubernetes.authenticate.driver.serviceAccountName=sparksa \
>   --conf spark.kubernetes.namespace=spark \
>   --class org.apache.spark.examples.SparkPi \
>   --conf spark.executor.instances=1 \
>   --conf spark.executor.cores=1 \
>   --conf spark.executor.memory=1500m \
>   --conf spark.driver.cores=1 \
>   --conf spark.driver.memory=1500m \
>   --conf spark.kubernetes.driver.limit.cores=1 \
>   --conf spark.kubernetes.driver.limit.memory=2G \
>   --conf spark.kubernetes.executor.limit.cores=1 \
>   --conf spark.kubernetes.executor.limit.memory=2G \
>    --conf spark.kubernetes.driver.label.app=spark \
>   --conf spark.kubernetes.executor.label.app=spark \
>   --conf spark.kubernetes.container.image=apache/spark:v3.3.2 \
>   --conf spark.kubernetes.scheduler.name=yunikorn \
>   --conf spark.kubernetes.driver.label.queue=root.spark \
>   --conf spark.kubernetes.executor.label.queue=root.spark \
>   --conf 
> spark.kubernetes.driver.annotation.yunikorn.apache.org/app-id={{APP_ID}} \
>   --conf 
> spark.kubernetes.executor.annotation.yunikorn.apache.org/app-id={{APP_ID}} \
>   --conf 
> spark.kubernetes.driver.annotation.yunikorn.apache.org/task-group-name=spark-driver
>  \
>   --conf 
> spark.kubernetes.driver.annotation.yunikorn.apache.org/task-groups='[{"name": 
> "spark-driver", "minMember": 1, "minResource": {"cpu": "1", "memory": 
> "2Gi"},"nodeSelector": {"app": "spark"} }, {"name": "spark-executor", 
> "minMember": 1, "minResource": {"cpu": "1", "memory": "2Gi"},"nodeSelector": 
> {"app": "spark"} }]' \
>   --conf 
> spark.kubernetes.driver.annotation.yunikorn.apache.org/schedulingPolicyParameters='placeholderTimeoutInSeconds=60
>  gangSchedulingStyle=Hard' \
>   --conf 
> spark.kubernetes.executor.annotation.yunikorn.apache.org/task-group-name=spark-executor
>  \
>   local:///opt/spark/examples/jars/spark-examples_2.12-3.3.2.jar \
>   3000 {code}
>  
> h2. scheduler logs
> kubectl  logs yunikorn-scheduler-56f599846b-8fl7d yunikorn-scheduler-k8s -n 
> yunikorn
> {code:java}
> 2024-06-18T15:15:26.201Z  DEBUG general/general.go:141  pod added {"appType": 
> "general", "Name": "spark-pi-f4f19b902beac663-driver", "Namespace": "spark"}
> 2024-06-18T15:15:26.201Z  DEBUG utils/utils.go:305  Unable to retrieve user 
> name from pod labels. Empty user label  {"userLabel": 
> "yunikorn.apache.org/username"}
> 2024-06-18T15:15:26.201Z  DEBUG cache/context.go:737  AddApplication  
> {"Request": 
> {"Metadata":{"ApplicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","QueueName":"root.spark","User":"nobody","Tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60
>  gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\": 
> \"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", 
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\": 
> \"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", 
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} 
> }]"},"Groups":null,"TaskGroups":[{"name":"spark-driver","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}},{"name":"spark-executor","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}}],"OwnerReferences":[{"apiVersion":"v1","kind":"Pod","name":"spark-pi-f4f19b902beac663-driver","uid":"4fb897dd-3af5-4799-a09d-640b5222ba3a","controller":false,"blockOwnerDeletion":true}],"SchedulingPolicyParameters":{},"CreationTime":0}}}
> 2024-06-18T15:15:26.201Z  DEBUG cache/context.go:746  app namespace info  
> {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "namespace": "spark"}
> 2024-06-18T15:15:26.201Z  INFO  cache/context.go:773  app added {"appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.201Z  DEBUG cache/context.go:841  AddTask {"appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:26.201Z  DEBUG cache/context.go:233  adding pod to cache 
> {"podName": "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:26.201Z  INFO  cache/context.go:863  task added  {"appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskState": "New"}
> 2024-06-18T15:15:26.201Z  INFO  cache/context.go:873  app request originating 
> pod added {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "original 
> task": "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:26.201Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (AddPod.Pre)  {"nodes": 3, "pods": 55, "assumed": 0, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases": 
> {"Pending":2,"Running":53}}
> 2024-06-18T15:15:26.201Z  DEBUG external/scheduler_cache.go:411 Putting pod 
> in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:26.201Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (AddPod.Post) {"nodes": 3, "pods": 56, "assumed": 0, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:26.277Z  DEBUG scheduler/scheduler.go:157  inspect 
> outstanding requests
> 2024-06-18T15:15:26.924Z  DEBUG cache/application_state.go:508  shim app 
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": 
> "New", "destination": "Submitted", "event": "SubmitApplication"}
> 2024-06-18T15:15:26.924Z  INFO  cache/application.go:424  handle app 
> submission {"app": "applicationID: spark-8e410a4c5ce44da2aa85ba835257a1e9, 
> queue: root.spark, partition: default, totalNumOfTasks: 1, currentState: 
> Submitted", "clusterID": "mycluster"}
> 2024-06-18T15:15:26.924Z  DEBUG scheduler/scheduler.go:96 enqueued event  
> {"eventType": "*rmevent.RMUpdateApplicationEvent", "event": 
> {"Request":{"new":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","queueName":"root.spark","partitionName":"[mycluster]default","ugi":{"user":"nobody"},"tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60
>  gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\": 
> \"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", 
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\": 
> \"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", 
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} 
> }]"},"executionTimeoutMilliSeconds":60000,"placeholderAsk":{"resources":{"memory":{"value":4294967296},"pods":{"value":2},"vcore":{"value":2000}}},"gangSchedulingStyle":"Hard"}],"rmID":"mycluster"}},
>  "currentQueueSize": 0}
> 2024-06-18T15:15:26.924Z  DEBUG placement/placement.go:145  Executing rule 
> for placing application  {"ruleName": "tag", "application": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.924Z  DEBUG placement/tag_rule.go:106 Tag rule 
> intermediate result  {"application": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue": "root.spark"}
> 2024-06-18T15:15:26.924Z  INFO  placement/tag_rule.go:115 Tag rule 
> application placed {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "queue": "root.spark"}
> 2024-06-18T15:15:26.924Z  DEBUG placement/placement.go:204  Rule result for 
> placing application {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "queueName": "root.spark"}
> 2024-06-18T15:15:26.925Z  INFO  scheduler/context.go:549  Added application 
> to partition  {"applicationID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "partitionName": "[mycluster]default", "requested queue": "root.spark", 
> "placed queue": "root.spark"}
> 2024-06-18T15:15:26.925Z  DEBUG rmproxy/rmproxy.go:59 enqueue event 
> {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": 
> {"RmID":"mycluster","AcceptedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9"}],"RejectedApplications":[],"UpdatedApplications":null},
>  "currentQueueSize": 0}
> 2024-06-18T15:15:26.925Z  DEBUG callback/scheduler_callback.go:108  
> UpdateApplication callback received {"UpdateApplicationResponse": 
> "accepted:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" > "}
> 2024-06-18T15:15:26.925Z  DEBUG callback/scheduler_callback.go:114  callback: 
> response to accepted application  {"appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.925Z  INFO  callback/scheduler_callback.go:118  Accepting 
> app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.925Z  DEBUG cache/application_state.go:508  shim app 
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": 
> "Submitted", "destination": "Accepted", "event": "AcceptApplication"}
> 2024-06-18T15:15:27.277Z  DEBUG scheduler/scheduler.go:157  inspect 
> outstanding requests
> 2024-06-18T15:15:27.925Z  DEBUG cache/application.go:516  postAppAccepted on 
> cached app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "numTaskGroups": 2, "numAllocatedTasks": 0}
> 2024-06-18T15:15:27.925Z  INFO  cache/application.go:526  app has taskGroups 
> defined, trying to reserve resources for gang members  {"appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:27.925Z  DEBUG cache/application_state.go:508  shim app 
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": 
> "Accepted", "destination": "Reserving", "event": "TryReserve"}
> 2024-06-18T15:15:27.933Z  ERROR cache/placeholder_manager.go:99 failed to 
> create placeholder pod  {"error": "pods 
> \"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed 
> quota: spark-quota: must specify limits.cpu,limits.memory"}
> github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
>   github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99
> github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
>   github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542
> 2024-06-18T15:15:27.933Z  INFO  cache/placeholder_manager.go:115  start to 
> clean up app placeholders  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:27.933Z  INFO  cache/placeholder_manager.go:128  finished 
> cleaning up app placeholders {"appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:27.933Z  DEBUG cache/application_state.go:508  shim app 
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": 
> "Reserving", "destination": "Running", "event": "RunApplication"}
> 2024-06-18T15:15:28.278Z  DEBUG scheduler/scheduler.go:157  inspect 
> outstanding requests
> 2024-06-18T15:15:28.867Z  DEBUG scheduler/partition_manager.go:83 time 
> consumed for queue cleaner {"duration": "6.41µs"}
> 2024-06-18T15:15:28.925Z  INFO  cache/task_state.go:380 Task state transition 
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": 
> "spark/spark-pi-f4f19b902beac663-driver", "source": "New", "destination": 
> "Pending", "event": "InitTask"}
> 2024-06-18T15:15:28.926Z  INFO  cache/task_state.go:380 Task state transition 
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": 
> "spark/spark-pi-f4f19b902beac663-driver", "source": "Pending", "destination": 
> "Scheduling", "event": "SubmitTask"}
> 2024-06-18T15:15:28.926Z  DEBUG cache/task.go:275 scheduling pod  {"podName": 
> "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:28.926Z  DEBUG cache/task.go:294 send update request 
> {"request": "asks:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\" 
> applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" 
> resourceAsk:<resources:<key:\"memory\" value:<value:1975517184 > > 
> resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\" 
> value:<value:1000 > > > maxAllocations:1 
> tags:<key:\"kubernetes.io/label/app\" value:\"spark\" > 
> tags:<key:\"kubernetes.io/label/queue\" value:\"root.spark\" > 
> tags:<key:\"kubernetes.io/label/spark-app-name\" value:\"spark-pi\" > 
> tags:<key:\"kubernetes.io/label/spark-app-selector\" 
> value:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" > 
> tags:<key:\"kubernetes.io/label/spark-role\" value:\"driver\" > 
> tags:<key:\"kubernetes.io/label/spark-version\" value:\"3.3.2\" > 
> tags:<key:\"kubernetes.io/meta/namespace\" value:\"spark\" > 
> tags:<key:\"kubernetes.io/meta/podName\" 
> value:\"spark-pi-f4f19b902beac663-driver\" > taskGroupName:\"spark-driver\" 
> Originator:true preemptionPolicy:<allowPreemptSelf:true 
> allowPreemptOther:true > > rmID:\"mycluster\" "}
> 2024-06-18T15:15:28.926Z  DEBUG scheduler/scheduler.go:96 enqueued event  
> {"eventType": "*rmevent.RMUpdateAllocationEvent", "event": 
> {"Request":{"asks":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","partitionName":"[mycluster]default","resourceAsk":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"maxAllocations":1,"tags":{"kubernetes.io/label/app":"spark","kubernetes.io/label/queue":"root.spark","kubernetes.io/label/spark-app-name":"spark-pi","kubernetes.io/label/spark-app-selector":"spark-8e410a4c5ce44da2aa85ba835257a1e9","kubernetes.io/label/spark-role":"driver","kubernetes.io/label/spark-version":"3.3.2","kubernetes.io/meta/namespace":"spark","kubernetes.io/meta/podName":"spark-pi-f4f19b902beac663-driver"},"taskGroupName":"spark-driver","Originator":true,"preemptionPolicy":{"allowPreemptSelf":true,"allowPreemptOther":true}}],"rmID":"mycluster"}},
>  "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z  INFO  objects/application_state.go:133  Application 
> state transition  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "source": "New", "destination": "Accepted", "event": "runApplication"}
> 2024-06-18T15:15:28.926Z  DEBUG rmproxy/rmproxy.go:59 enqueue event 
> {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": 
> {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Accepted","stateTransitionTimestamp":1718723728926391633,"message":"runApplication"}]},
>  "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z  INFO  objects/application.go:668  ask added 
> successfully to application {"appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "ask": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "placeholder": false, "pendingDelta": 
> "map[memory:1975517184 pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:108  
> UpdateApplication callback received {"UpdateApplicationResponse": 
> "updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" 
> state:\"Accepted\" stateTransitionTimestamp:1718723728926391633 
> message:\"runApplication\" > "}
> 2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:137  status 
> update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "new status": "Accepted"}
> 2024-06-18T15:15:28.926Z  DEBUG objects/application.go:339  Application state 
> timer initiated {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "state": 
> "Starting", "timeout": "5m0s"}
> 2024-06-18T15:15:28.926Z  INFO  objects/application_state.go:133  Application 
> state transition  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "source": "Accepted", "destination": "Starting", "event": "runApplication"}
> 2024-06-18T15:15:28.926Z  DEBUG rmproxy/rmproxy.go:59 enqueue event 
> {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": 
> {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Starting","stateTransitionTimestamp":1718723728926635454,"message":"runApplication"}]},
>  "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/manager.go:63 Increasing resource usage 
> {"user": "nobody", "queue path": "root.spark", "application": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue 
> tracker object for queue {"queue": "root"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource 
> usage {"queue path": "root.spark", "application": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue 
> tracker object for queue {"queue": "spark"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource 
> usage {"queue path": "spark", "application": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/manager.go:257  Group tracker does not 
> exist. Creating group tracker object and linking the same with application 
> {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue path": 
> "root.spark", "user": "nobody", "group": "nobody"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue 
> tracker object for queue {"queue": "root"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource 
> usage {"queue path": "root.spark", "application": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue 
> tracker object for queue {"queue": "spark"}
> 2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource 
> usage {"queue path": "spark", "application": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:108  
> UpdateApplication callback received {"UpdateApplicationResponse": 
> "updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" 
> state:\"Starting\" stateTransitionTimestamp:1718723728926635454 
> message:\"runApplication\" > "}
> 2024-06-18T15:15:28.926Z  DEBUG objects/queue.go:1239 allocation found on 
> queue {"queueName": "root.spark", "appID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "allocation": 
> "applicationID=spark-8e410a4c5ce44da2aa85ba835257a1e9, 
> uuid=3e42c4e0-0fc3-466c-a524-444b2eec700b, 
> allocationKey=4fb897dd-3af5-4799-a09d-640b5222ba3a, Node=10.10.10.66, 
> result=Allocated"}
> 2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:137  status 
> update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "new status": "Starting"}
> 2024-06-18T15:15:28.926Z  INFO  scheduler/partition.go:888  scheduler 
> allocation processed  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
> "allocationKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "uuid": 
> "3e42c4e0-0fc3-466c-a524-444b2eec700b", "allocatedResource": 
> "map[memory:1975517184 pods:1 vcore:1000]", "placeholder": false, 
> "targetNode": "10.10.10.66"}
> 2024-06-18T15:15:28.926Z  DEBUG rmproxy/rmproxy.go:59 enqueue event 
> {"eventType": "*rmevent.RMNewAllocationsEvent", "event": 
> {"RmID":"mycluster","Allocations":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","UUID":"3e42c4e0-0fc3-466c-a524-444b2eec700b","resourcePerAlloc":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"nodeID":"10.10.10.66","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","taskGroupName":"spark-driver"}]},
>  "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:48 
> UpdateAllocation callback received  {"UpdateAllocationResponse": 
> "new:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\" 
> UUID:\"3e42c4e0-0fc3-466c-a524-444b2eec700b\" 
> resourcePerAlloc:<resources:<key:\"memory\" value:<value:1975517184 > > 
> resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\" 
> value:<value:1000 > > > nodeID:\"10.10.10.66\" 
> applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" 
> taskGroupName:\"spark-driver\" > "}
> 2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:53 callback: 
> response to new allocation  {"allocationKey": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "UUID": 
> "3e42c4e0-0fc3-466c-a524-444b2eec700b", "applicationID": 
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "nodeID": "10.10.10.66"}
> 2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (AssumePod.Pre) {"nodes": 3, "pods": 56, "assumed": 0, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:476 Adding 
> assumed pod to cache {"podName": "spark-pi-f4f19b902beac663-driver", 
> "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "node": "10.10.10.66", 
> "allBound": true}
> 2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:411 Putting pod 
> in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (AssumePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.926Z  DEBUG scheduler/context.go:853  Successfully synced 
> shim on new allocation. response: no. of allocations: 1
> 2024-06-18T15:15:28.926Z  INFO  cache/task_state.go:380 Task state transition 
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": 
> "spark/spark-pi-f4f19b902beac663-driver", "source": "Scheduling", 
> "destination": "Allocated", "event": "TaskAllocated"}
> 2024-06-18T15:15:28.926Z  DEBUG cache/task.go:349 bind pod volumes  
> {"podName": "spark-pi-f4f19b902beac663-driver", "podUID": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.926Z  INFO  cache/context.go:499  Binding Pod Volumes 
> skipped: all volumes already bound  {"podName": 
> "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:28.926Z  DEBUG cache/task.go:362 bind pod  {"podName": 
> "spark-pi-f4f19b902beac663-driver", "podUID": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.926Z  INFO  client/kubeclient.go:112  bind pod to node  
> {"podName": "spark-pi-f4f19b902beac663-driver", "podUID": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "nodeID": "10.10.10.66"}
> 2024-06-18T15:15:28.933Z  INFO  cache/task.go:375 successfully bound pod  
> {"podName": "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:28.934Z  INFO  cache/task_state.go:380 Task state transition 
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": 
> "spark/spark-pi-f4f19b902beac663-driver", "source": "Allocated", 
> "destination": "Bound", "event": "TaskBound"}
> 2024-06-18T15:15:28.934Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.934Z  DEBUG external/scheduler_cache.go:411 Putting pod 
> in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.934Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.946Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.946Z  DEBUG external/scheduler_cache.go:411 Putting pod 
> in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.946Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:29.278Z  DEBUG scheduler/scheduler.go:157  inspect 
> outstanding requests
> 2024-06-18T15:15:29.906Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:29.906Z  DEBUG external/scheduler_cache.go:411 Putting pod 
> in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:29.906Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:30.113Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:30.113Z  DEBUG external/scheduler_cache.go:411 Putting pod 
> in cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:30.113Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 0, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":2,"Running":54}}
> 2024-06-18T15:15:30.279Z  DEBUG scheduler/scheduler.go:157  inspect 
> outstanding requests
> 2024-06-18T15:15:31.279Z  DEBUG scheduler/scheduler.go:157  inspect 
> outstanding requests
> 2024-06-18T15:15:32.280Z  DEBUG scheduler/scheduler.go:157  inspect 
> outstanding requests
> 2024-06-18T15:15:32.395Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdateNode.Pre)  {"nodes": 3, "pods": 56, "assumed": 0, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":2,"Running":54}}
> 2024-06-18T15:15:32.395Z  DEBUG external/scheduler_cache.go:179 Updating node 
> in cache  {"nodeName": "10.10.10.66"}
> 2024-06-18T15:15:32.395Z  DEBUG external/scheduler_cache.go:558 Scheduler 
> cache state (UpdateNode.Post) {"nodes": 3, "pods": 56, "assumed": 0, 
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
> {"Pending":2,"Running":54}}
> 2024-06-18T15:15:32.395Z  DEBUG cache/node.go:109 set node capacity 
> {"nodeID": "10.10.10.66", "capacity": "resources:<key:\"ephemeral-storage\" 
> value:<value:478652923105 > > resources:<key:\"hugepages-1Gi\" value:<> > 
> resources:<key:\"hugepages-2Mi\" value:<> > 
> resources:<key:\"kubernetes.io/batch-cpu\" value:<value:5167 > > 
> resources:<key:\"kubernetes.io/batch-memory\" value:<value:34437364710 > > 
> resources:<key:\"kubernetes.io/mid-cpu\" value:<value:3156 > > 
> resources:<key:\"kubernetes.io/mid-memory\" value:<value:1442139746 > > 
> resources:<key:\"memory\" value:<value:64713940992 > > 
> resources:<key:\"pods\" value:<value:110 > > resources:<key:\"vcore\" 
> value:<value:14000 > > "}
> 2024-06-18T15:15:32.396Z  INFO  cache/nodes.go:179  Node's ready status flag  
> {"Node name": "10.10.10.66", "ready": true}
> 2024-06-18T15:15:32.396Z  INFO  cache/nodes.go:184  report updated nodes to 
> scheduler {"request": 
> {"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}}
> 2024-06-18T15:15:32.396Z  DEBUG scheduler/scheduler.go:96 enqueued event  
> {"eventType": "*rmevent.RMUpdateNodeEvent", "event": 
> {"Request":{"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true","si/node-partition":"[mycluster]default"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}},
>  "currentQueueSize": 0}
> 2024-06-18T15:15:32.396Z  INFO  objects/queue.go:1190 updating root queue max 
> resources {"current max": "map[ephemeral-storage:1435958769315 
> hugepages-1Gi:0 hugepages-2Mi:0 kubernetes.io/batch-cpu:19610 
> kubernetes.io/batch-memory:109317941106 kubernetes.io/mid-cpu:9574 
> kubernetes.io/mid-memory:5209251314 memory:194141839360 pods:330 
> vcore:42000]", "new max": "map[ephemeral-storage:1435958769315 
> hugepages-1Gi:0 hugepages-2Mi:0 kubernetes.io/batch-cpu:19309 
> kubernetes.io/batch-memory:109530403859 kubernetes.io/mid-cpu:10574 
> kubernetes.io/mid-memory:6651391060 memory:194141839360 pods:330 
> vcore:42000]"}
> 2024-06-18T15:15:32.396Z  DEBUG objects/node.go:182 skip updating 
> occupiedResource, not changed {code}
> h2. Spark Pod status
> kubectl get pod -n spark|grep Running
> !image-2024-06-19-00-02-53-178.png!
> kubectl describe pod spark-pi-xxxxxxxx-driver -n spark
> !image-2024-06-19-00-03-09-703.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to