[
https://issues.apache.org/jira/browse/YUNIKORN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Craig Condit closed YUNIKORN-2682.
----------------------------------
Assignee: Craig Condit
Resolution: Workaround
> YuniKorn Gang Scheduling Issue: Executors Failing to Start When Running
> Multiple Applications
> ---------------------------------------------------------------------------------------------
>
> Key: YUNIKORN-2682
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2682
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Affects Versions: 1.3.0
> Reporter: huangzhir
> Assignee: Craig Condit
> Priority: Major
> Attachments: image-2024-06-19-00-02-53-178.png,
> image-2024-06-19-00-03-09-703.png
>
>
> h2. Description:
> While using YuniKorn's gang scheduling, we encountered a situation where the
> scheduling process appears to succeed, but in reality, there is a problem.
> When submitting two applications simultaneously, only the driver pods are
> successfully running, and the executor pods fail to start due to insufficient
> resources. The following error is observed in the scheduler logs:
> {code:java}
> 2024-06-18T15:15:27.933Z ERROR cache/placeholder_manager.go:99 failed to
> create placeholder pod {"error": "pods
> \"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed
> quota: spark-quota: must specify limits.cpu,limits.memory"}
> github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
> github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99
> github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
> github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542 {code}
> h2. Environment:
> * YuniKorn version: 1.3.0
> * Kubernetes version: 1.21.3
> * Spark version: 3.2.2
> h2. *resource-quota.yaml*
> {code:java}
> apiVersion: v1
> kind: ResourceQuota
> metadata:
> name: spark-quota
> namespace: spark
> spec:
> hard:
> requests.cpu: "5"
> requests.memory: "5Gi"
> limits.cpu: "5"
> limits.memory: "5Gi" {code}
> h2. yunikorn-configs.yaml
> {code:java}
> apiVersion: v1
> kind: ConfigMap
> metadata:
> name: yunikorn-configs
> namespace: yunikorn
> data:
> log.level: "-1"
> log.admission.level: "-1"
> log.core.config.level: "-1"
> queues.yaml: |
> partitions:
> - name: default
> placementrules:
> - name: tag
> value: namespace
> create: true
> queues:
> - name: root
> submitacl: '*'
> properties:
> application.sort.policy: fifo
> placeholderTimeoutInSeconds: 60
> schedulingStyle: Hard
> queues:
> - name: spark
> properties:
> application.sort.policy: fifo
> placeholderTimeoutInSeconds: 60
> schedulingStyle: Hard
> resources:
> guaranteed:
> vcore: 5
> memory: 5Gi
> max:
> vcore: 5
> memory: 5Gi {code}
> h2. Spark-submit command
> {code:java}
> ./bin/spark-submit \
> --master k8s://https://10.10.10.10:6443 \
> --deploy-mode cluster \
> --name spark-pi \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=sparksa \
> --conf spark.kubernetes.namespace=spark \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.executor.instances=1 \
> --conf spark.executor.cores=1 \
> --conf spark.executor.memory=1500m \
> --conf spark.driver.cores=1 \
> --conf spark.driver.memory=1500m \
> --conf spark.kubernetes.driver.limit.cores=1 \
> --conf spark.kubernetes.driver.limit.memory=2G \
> --conf spark.kubernetes.executor.limit.cores=1 \
> --conf spark.kubernetes.executor.limit.memory=2G \
> --conf spark.kubernetes.driver.label.app=spark \
> --conf spark.kubernetes.executor.label.app=spark \
> --conf spark.kubernetes.container.image=apache/spark:v3.3.2 \
> --conf spark.kubernetes.scheduler.name=yunikorn \
> --conf spark.kubernetes.driver.label.queue=root.spark \
> --conf spark.kubernetes.executor.label.queue=root.spark \
> --conf
> spark.kubernetes.driver.annotation.yunikorn.apache.org/app-id={{APP_ID}} \
> --conf
> spark.kubernetes.executor.annotation.yunikorn.apache.org/app-id={{APP_ID}} \
> --conf
> spark.kubernetes.driver.annotation.yunikorn.apache.org/task-group-name=spark-driver
> \
> --conf
> spark.kubernetes.driver.annotation.yunikorn.apache.org/task-groups='[{"name":
> "spark-driver", "minMember": 1, "minResource": {"cpu": "1", "memory":
> "2Gi"},"nodeSelector": {"app": "spark"} }, {"name": "spark-executor",
> "minMember": 1, "minResource": {"cpu": "1", "memory": "2Gi"},"nodeSelector":
> {"app": "spark"} }]' \
> --conf
> spark.kubernetes.driver.annotation.yunikorn.apache.org/schedulingPolicyParameters='placeholderTimeoutInSeconds=60
> gangSchedulingStyle=Hard' \
> --conf
> spark.kubernetes.executor.annotation.yunikorn.apache.org/task-group-name=spark-executor
> \
> local:///opt/spark/examples/jars/spark-examples_2.12-3.3.2.jar \
> 3000 {code}
>
> h2. scheduler logs
> kubectl logs yunikorn-scheduler-56f599846b-8fl7d yunikorn-scheduler-k8s -n
> yunikorn
> {code:java}
> 2024-06-18T15:15:26.201Z DEBUG general/general.go:141 pod added {"appType":
> "general", "Name": "spark-pi-f4f19b902beac663-driver", "Namespace": "spark"}
> 2024-06-18T15:15:26.201Z DEBUG utils/utils.go:305 Unable to retrieve user
> name from pod labels. Empty user label {"userLabel":
> "yunikorn.apache.org/username"}
> 2024-06-18T15:15:26.201Z DEBUG cache/context.go:737 AddApplication
> {"Request":
> {"Metadata":{"ApplicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","QueueName":"root.spark","User":"nobody","Tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60
> gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\":
> \"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\",
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\":
> \"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\",
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"}
> }]"},"Groups":null,"TaskGroups":[{"name":"spark-driver","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}},{"name":"spark-executor","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}}],"OwnerReferences":[{"apiVersion":"v1","kind":"Pod","name":"spark-pi-f4f19b902beac663-driver","uid":"4fb897dd-3af5-4799-a09d-640b5222ba3a","controller":false,"blockOwnerDeletion":true}],"SchedulingPolicyParameters":{},"CreationTime":0}}}
> 2024-06-18T15:15:26.201Z DEBUG cache/context.go:746 app namespace info
> {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "namespace": "spark"}
> 2024-06-18T15:15:26.201Z INFO cache/context.go:773 app added {"appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.201Z DEBUG cache/context.go:841 AddTask {"appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:26.201Z DEBUG cache/context.go:233 adding pod to cache
> {"podName": "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:26.201Z INFO cache/context.go:863 task added {"appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskState": "New"}
> 2024-06-18T15:15:26.201Z INFO cache/context.go:873 app request originating
> pod added {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "original
> task": "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:26.201Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (AddPod.Pre) {"nodes": 3, "pods": 55, "assumed": 0,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases":
> {"Pending":2,"Running":53}}
> 2024-06-18T15:15:26.201Z DEBUG external/scheduler_cache.go:411 Putting pod
> in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:26.201Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (AddPod.Post) {"nodes": 3, "pods": 56, "assumed": 0,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:26.277Z DEBUG scheduler/scheduler.go:157 inspect
> outstanding requests
> 2024-06-18T15:15:26.924Z DEBUG cache/application_state.go:508 shim app
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source":
> "New", "destination": "Submitted", "event": "SubmitApplication"}
> 2024-06-18T15:15:26.924Z INFO cache/application.go:424 handle app
> submission {"app": "applicationID: spark-8e410a4c5ce44da2aa85ba835257a1e9,
> queue: root.spark, partition: default, totalNumOfTasks: 1, currentState:
> Submitted", "clusterID": "mycluster"}
> 2024-06-18T15:15:26.924Z DEBUG scheduler/scheduler.go:96 enqueued event
> {"eventType": "*rmevent.RMUpdateApplicationEvent", "event":
> {"Request":{"new":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","queueName":"root.spark","partitionName":"[mycluster]default","ugi":{"user":"nobody"},"tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60
> gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\":
> \"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\",
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\":
> \"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\",
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"}
> }]"},"executionTimeoutMilliSeconds":60000,"placeholderAsk":{"resources":{"memory":{"value":4294967296},"pods":{"value":2},"vcore":{"value":2000}}},"gangSchedulingStyle":"Hard"}],"rmID":"mycluster"}},
> "currentQueueSize": 0}
> 2024-06-18T15:15:26.924Z DEBUG placement/placement.go:145 Executing rule
> for placing application {"ruleName": "tag", "application":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.924Z DEBUG placement/tag_rule.go:106 Tag rule
> intermediate result {"application":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue": "root.spark"}
> 2024-06-18T15:15:26.924Z INFO placement/tag_rule.go:115 Tag rule
> application placed {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "queue": "root.spark"}
> 2024-06-18T15:15:26.924Z DEBUG placement/placement.go:204 Rule result for
> placing application {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "queueName": "root.spark"}
> 2024-06-18T15:15:26.925Z INFO scheduler/context.go:549 Added application
> to partition {"applicationID": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "partitionName": "[mycluster]default", "requested queue": "root.spark",
> "placed queue": "root.spark"}
> 2024-06-18T15:15:26.925Z DEBUG rmproxy/rmproxy.go:59 enqueue event
> {"eventType": "*rmevent.RMApplicationUpdateEvent", "event":
> {"RmID":"mycluster","AcceptedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9"}],"RejectedApplications":[],"UpdatedApplications":null},
> "currentQueueSize": 0}
> 2024-06-18T15:15:26.925Z DEBUG callback/scheduler_callback.go:108
> UpdateApplication callback received {"UpdateApplicationResponse":
> "accepted:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" > "}
> 2024-06-18T15:15:26.925Z DEBUG callback/scheduler_callback.go:114 callback:
> response to accepted application {"appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.925Z INFO callback/scheduler_callback.go:118 Accepting
> app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.925Z DEBUG cache/application_state.go:508 shim app
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source":
> "Submitted", "destination": "Accepted", "event": "AcceptApplication"}
> 2024-06-18T15:15:27.277Z DEBUG scheduler/scheduler.go:157 inspect
> outstanding requests
> 2024-06-18T15:15:27.925Z DEBUG cache/application.go:516 postAppAccepted on
> cached app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "numTaskGroups": 2, "numAllocatedTasks": 0}
> 2024-06-18T15:15:27.925Z INFO cache/application.go:526 app has taskGroups
> defined, trying to reserve resources for gang members {"appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:27.925Z DEBUG cache/application_state.go:508 shim app
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source":
> "Accepted", "destination": "Reserving", "event": "TryReserve"}
> 2024-06-18T15:15:27.933Z ERROR cache/placeholder_manager.go:99 failed to
> create placeholder pod {"error": "pods
> \"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed
> quota: spark-quota: must specify limits.cpu,limits.memory"}
> github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
> github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99
> github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
> github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542
> 2024-06-18T15:15:27.933Z INFO cache/placeholder_manager.go:115 start to
> clean up app placeholders {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:27.933Z INFO cache/placeholder_manager.go:128 finished
> cleaning up app placeholders {"appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:27.933Z DEBUG cache/application_state.go:508 shim app
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source":
> "Reserving", "destination": "Running", "event": "RunApplication"}
> 2024-06-18T15:15:28.278Z DEBUG scheduler/scheduler.go:157 inspect
> outstanding requests
> 2024-06-18T15:15:28.867Z DEBUG scheduler/partition_manager.go:83 time
> consumed for queue cleaner {"duration": "6.41µs"}
> 2024-06-18T15:15:28.925Z INFO cache/task_state.go:380 Task state transition
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias":
> "spark/spark-pi-f4f19b902beac663-driver", "source": "New", "destination":
> "Pending", "event": "InitTask"}
> 2024-06-18T15:15:28.926Z INFO cache/task_state.go:380 Task state transition
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias":
> "spark/spark-pi-f4f19b902beac663-driver", "source": "Pending", "destination":
> "Scheduling", "event": "SubmitTask"}
> 2024-06-18T15:15:28.926Z DEBUG cache/task.go:275 scheduling pod {"podName":
> "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:28.926Z DEBUG cache/task.go:294 send update request
> {"request": "asks:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\"
> applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\"
> resourceAsk:<resources:<key:\"memory\" value:<value:1975517184 > >
> resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\"
> value:<value:1000 > > > maxAllocations:1
> tags:<key:\"kubernetes.io/label/app\" value:\"spark\" >
> tags:<key:\"kubernetes.io/label/queue\" value:\"root.spark\" >
> tags:<key:\"kubernetes.io/label/spark-app-name\" value:\"spark-pi\" >
> tags:<key:\"kubernetes.io/label/spark-app-selector\"
> value:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" >
> tags:<key:\"kubernetes.io/label/spark-role\" value:\"driver\" >
> tags:<key:\"kubernetes.io/label/spark-version\" value:\"3.3.2\" >
> tags:<key:\"kubernetes.io/meta/namespace\" value:\"spark\" >
> tags:<key:\"kubernetes.io/meta/podName\"
> value:\"spark-pi-f4f19b902beac663-driver\" > taskGroupName:\"spark-driver\"
> Originator:true preemptionPolicy:<allowPreemptSelf:true
> allowPreemptOther:true > > rmID:\"mycluster\" "}
> 2024-06-18T15:15:28.926Z DEBUG scheduler/scheduler.go:96 enqueued event
> {"eventType": "*rmevent.RMUpdateAllocationEvent", "event":
> {"Request":{"asks":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","partitionName":"[mycluster]default","resourceAsk":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"maxAllocations":1,"tags":{"kubernetes.io/label/app":"spark","kubernetes.io/label/queue":"root.spark","kubernetes.io/label/spark-app-name":"spark-pi","kubernetes.io/label/spark-app-selector":"spark-8e410a4c5ce44da2aa85ba835257a1e9","kubernetes.io/label/spark-role":"driver","kubernetes.io/label/spark-version":"3.3.2","kubernetes.io/meta/namespace":"spark","kubernetes.io/meta/podName":"spark-pi-f4f19b902beac663-driver"},"taskGroupName":"spark-driver","Originator":true,"preemptionPolicy":{"allowPreemptSelf":true,"allowPreemptOther":true}}],"rmID":"mycluster"}},
> "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z INFO objects/application_state.go:133 Application
> state transition {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "source": "New", "destination": "Accepted", "event": "runApplication"}
> 2024-06-18T15:15:28.926Z DEBUG rmproxy/rmproxy.go:59 enqueue event
> {"eventType": "*rmevent.RMApplicationUpdateEvent", "event":
> {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Accepted","stateTransitionTimestamp":1718723728926391633,"message":"runApplication"}]},
> "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z INFO objects/application.go:668 ask added
> successfully to application {"appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "ask":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "placeholder": false, "pendingDelta":
> "map[memory:1975517184 pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:108
> UpdateApplication callback received {"UpdateApplicationResponse":
> "updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\"
> state:\"Accepted\" stateTransitionTimestamp:1718723728926391633
> message:\"runApplication\" > "}
> 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:137 status
> update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "new status": "Accepted"}
> 2024-06-18T15:15:28.926Z DEBUG objects/application.go:339 Application state
> timer initiated {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "state":
> "Starting", "timeout": "5m0s"}
> 2024-06-18T15:15:28.926Z INFO objects/application_state.go:133 Application
> state transition {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "source": "Accepted", "destination": "Starting", "event": "runApplication"}
> 2024-06-18T15:15:28.926Z DEBUG rmproxy/rmproxy.go:59 enqueue event
> {"eventType": "*rmevent.RMApplicationUpdateEvent", "event":
> {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Starting","stateTransitionTimestamp":1718723728926635454,"message":"runApplication"}]},
> "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z DEBUG ugm/manager.go:63 Increasing resource usage
> {"user": "nobody", "queue path": "root.spark", "application":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue
> tracker object for queue {"queue": "root"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource
> usage {"queue path": "root.spark", "application":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue
> tracker object for queue {"queue": "spark"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource
> usage {"queue path": "spark", "application":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/manager.go:257 Group tracker does not
> exist. Creating group tracker object and linking the same with application
> {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue path":
> "root.spark", "user": "nobody", "group": "nobody"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue
> tracker object for queue {"queue": "root"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource
> usage {"queue path": "root.spark", "application":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue
> tracker object for queue {"queue": "spark"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource
> usage {"queue path": "spark", "application":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:108
> UpdateApplication callback received {"UpdateApplicationResponse":
> "updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\"
> state:\"Starting\" stateTransitionTimestamp:1718723728926635454
> message:\"runApplication\" > "}
> 2024-06-18T15:15:28.926Z DEBUG objects/queue.go:1239 allocation found on
> queue {"queueName": "root.spark", "appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "allocation":
> "applicationID=spark-8e410a4c5ce44da2aa85ba835257a1e9,
> uuid=3e42c4e0-0fc3-466c-a524-444b2eec700b,
> allocationKey=4fb897dd-3af5-4799-a09d-640b5222ba3a, Node=10.10.10.66,
> result=Allocated"}
> 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:137 status
> update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "new status": "Starting"}
> 2024-06-18T15:15:28.926Z INFO scheduler/partition.go:888 scheduler
> allocation processed {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "allocationKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "uuid":
> "3e42c4e0-0fc3-466c-a524-444b2eec700b", "allocatedResource":
> "map[memory:1975517184 pods:1 vcore:1000]", "placeholder": false,
> "targetNode": "10.10.10.66"}
> 2024-06-18T15:15:28.926Z DEBUG rmproxy/rmproxy.go:59 enqueue event
> {"eventType": "*rmevent.RMNewAllocationsEvent", "event":
> {"RmID":"mycluster","Allocations":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","UUID":"3e42c4e0-0fc3-466c-a524-444b2eec700b","resourcePerAlloc":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"nodeID":"10.10.10.66","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","taskGroupName":"spark-driver"}]},
> "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:48
> UpdateAllocation callback received {"UpdateAllocationResponse":
> "new:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\"
> UUID:\"3e42c4e0-0fc3-466c-a524-444b2eec700b\"
> resourcePerAlloc:<resources:<key:\"memory\" value:<value:1975517184 > >
> resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\"
> value:<value:1000 > > > nodeID:\"10.10.10.66\"
> applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\"
> taskGroupName:\"spark-driver\" > "}
> 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:53 callback:
> response to new allocation {"allocationKey":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "UUID":
> "3e42c4e0-0fc3-466c-a524-444b2eec700b", "applicationID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "nodeID": "10.10.10.66"}
> 2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (AssumePod.Pre) {"nodes": 3, "pods": 56, "assumed": 0,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:476 Adding
> assumed pod to cache {"podName": "spark-pi-f4f19b902beac663-driver",
> "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "node": "10.10.10.66",
> "allBound": true}
> 2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:411 Putting pod
> in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (AssumePod.Post) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.926Z DEBUG scheduler/context.go:853 Successfully synced
> shim on new allocation. response: no. of allocations: 1
> 2024-06-18T15:15:28.926Z INFO cache/task_state.go:380 Task state transition
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias":
> "spark/spark-pi-f4f19b902beac663-driver", "source": "Scheduling",
> "destination": "Allocated", "event": "TaskAllocated"}
> 2024-06-18T15:15:28.926Z DEBUG cache/task.go:349 bind pod volumes
> {"podName": "spark-pi-f4f19b902beac663-driver", "podUID":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.926Z INFO cache/context.go:499 Binding Pod Volumes
> skipped: all volumes already bound {"podName":
> "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:28.926Z DEBUG cache/task.go:362 bind pod {"podName":
> "spark-pi-f4f19b902beac663-driver", "podUID":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.926Z INFO client/kubeclient.go:112 bind pod to node
> {"podName": "spark-pi-f4f19b902beac663-driver", "podUID":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "nodeID": "10.10.10.66"}
> 2024-06-18T15:15:28.933Z INFO cache/task.go:375 successfully bound pod
> {"podName": "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:28.934Z INFO cache/task_state.go:380 Task state transition
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias":
> "spark/spark-pi-f4f19b902beac663-driver", "source": "Allocated",
> "destination": "Bound", "event": "TaskBound"}
> 2024-06-18T15:15:28.934Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.934Z DEBUG external/scheduler_cache.go:411 Putting pod
> in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.934Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.946Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.946Z DEBUG external/scheduler_cache.go:411 Putting pod
> in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.946Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:29.278Z DEBUG scheduler/scheduler.go:157 inspect
> outstanding requests
> 2024-06-18T15:15:29.906Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:29.906Z DEBUG external/scheduler_cache.go:411 Putting pod
> in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:29.906Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:30.113Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:30.113Z DEBUG external/scheduler_cache.go:411 Putting pod
> in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:30.113Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 0,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":2,"Running":54}}
> 2024-06-18T15:15:30.279Z DEBUG scheduler/scheduler.go:157 inspect
> outstanding requests
> 2024-06-18T15:15:31.279Z DEBUG scheduler/scheduler.go:157 inspect
> outstanding requests
> 2024-06-18T15:15:32.280Z DEBUG scheduler/scheduler.go:157 inspect
> outstanding requests
> 2024-06-18T15:15:32.395Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdateNode.Pre) {"nodes": 3, "pods": 56, "assumed": 0,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":2,"Running":54}}
> 2024-06-18T15:15:32.395Z DEBUG external/scheduler_cache.go:179 Updating node
> in cache {"nodeName": "10.10.10.66"}
> 2024-06-18T15:15:32.395Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdateNode.Post) {"nodes": 3, "pods": 56, "assumed": 0,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":2,"Running":54}}
> 2024-06-18T15:15:32.395Z DEBUG cache/node.go:109 set node capacity
> {"nodeID": "10.10.10.66", "capacity": "resources:<key:\"ephemeral-storage\"
> value:<value:478652923105 > > resources:<key:\"hugepages-1Gi\" value:<> >
> resources:<key:\"hugepages-2Mi\" value:<> >
> resources:<key:\"kubernetes.io/batch-cpu\" value:<value:5167 > >
> resources:<key:\"kubernetes.io/batch-memory\" value:<value:34437364710 > >
> resources:<key:\"kubernetes.io/mid-cpu\" value:<value:3156 > >
> resources:<key:\"kubernetes.io/mid-memory\" value:<value:1442139746 > >
> resources:<key:\"memory\" value:<value:64713940992 > >
> resources:<key:\"pods\" value:<value:110 > > resources:<key:\"vcore\"
> value:<value:14000 > > "}
> 2024-06-18T15:15:32.396Z INFO cache/nodes.go:179 Node's ready status flag
> {"Node name": "10.10.10.66", "ready": true}
> 2024-06-18T15:15:32.396Z INFO cache/nodes.go:184 report updated nodes to
> scheduler {"request":
> {"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}}
> 2024-06-18T15:15:32.396Z DEBUG scheduler/scheduler.go:96 enqueued event
> {"eventType": "*rmevent.RMUpdateNodeEvent", "event":
> {"Request":{"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true","si/node-partition":"[mycluster]default"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}},
> "currentQueueSize": 0}
> 2024-06-18T15:15:32.396Z INFO objects/queue.go:1190 updating root queue max
> resources {"current max": "map[ephemeral-storage:1435958769315
> hugepages-1Gi:0 hugepages-2Mi:0 kubernetes.io/batch-cpu:19610
> kubernetes.io/batch-memory:109317941106 kubernetes.io/mid-cpu:9574
> kubernetes.io/mid-memory:5209251314 memory:194141839360 pods:330
> vcore:42000]", "new max": "map[ephemeral-storage:1435958769315
> hugepages-1Gi:0 hugepages-2Mi:0 kubernetes.io/batch-cpu:19309
> kubernetes.io/batch-memory:109530403859 kubernetes.io/mid-cpu:10574
> kubernetes.io/mid-memory:6651391060 memory:194141839360 pods:330
> vcore:42000]"}
> 2024-06-18T15:15:32.396Z DEBUG objects/node.go:182 skip updating
> occupiedResource, not changed {code}
> h2. Spark Pod status
> kubectl get pod -n spark|grep Running
> !image-2024-06-19-00-02-53-178.png!
> kubectl describe pod spark-pi-xxxxxxxx-driver -n spark
> !image-2024-06-19-00-03-09-703.png!
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]