[
https://issues.apache.org/jira/browse/YUNIKORN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856032#comment-17856032
]
Craig Condit commented on YUNIKORN-2682:
----------------------------------------
It looks like you're using the LimitRange controller with policies that force
specifying limits for cpu and memory. YuniKorn does not support that in 1.3, so
the placeholder creation fails. You will need to either: upgrade to YuniKorn
1.4.0 or later (which will require Kubernetes to be >= 1.24.0), or disable the
LimitRange controller.
> YuniKorn Gang Scheduling Issue: Executors Failing to Start When Running
> Multiple Applications
> ---------------------------------------------------------------------------------------------
>
> Key: YUNIKORN-2682
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2682
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Affects Versions: 1.3.0
> Reporter: huangzhir
> Priority: Major
> Attachments: image-2024-06-19-00-02-53-178.png,
> image-2024-06-19-00-03-09-703.png
>
>
> h2. Description:
> While using YuniKorn's gang scheduling, we encountered a situation where the
> scheduling process appears to succeed, but in reality, there is a problem.
> When submitting two applications simultaneously, only the driver pods are
> successfully running, and the executor pods fail to start due to insufficient
> resources. The following error is observed in the scheduler logs:
> {code:java}
> 2024-06-18T15:15:27.933Z ERROR cache/placeholder_manager.go:99 failed to
> create placeholder pod {"error": "pods
> \"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed
> quota: spark-quota: must specify limits.cpu,limits.memory"}
> github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
> github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99
> github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
> github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542 {code}
> h2. Environment:
> * YuniKorn version: 1.3.0
> * Kubernetes version: 1.21.3
> * Spark version: 3.2.2
> h2. *resource-quota.yaml*
> {code:java}
> apiVersion: v1
> kind: ResourceQuota
> metadata:
> name: spark-quota
> namespace: spark
> spec:
> hard:
> requests.cpu: "5"
> requests.memory: "5Gi"
> limits.cpu: "5"
> limits.memory: "5Gi" {code}
> h2. yunikorn-configs.yaml
> {code:java}
> apiVersion: v1
> kind: ConfigMap
> metadata:
> name: yunikorn-configs
> namespace: yunikorn
> data:
> log.level: "-1"
> log.admission.level: "-1"
> log.core.config.level: "-1"
> queues.yaml: |
> partitions:
> - name: default
> placementrules:
> - name: tag
> value: namespace
> create: true
> queues:
> - name: root
> submitacl: '*'
> properties:
> application.sort.policy: fifo
> placeholderTimeoutInSeconds: 60
> schedulingStyle: Hard
> queues:
> - name: spark
> properties:
> application.sort.policy: fifo
> placeholderTimeoutInSeconds: 60
> schedulingStyle: Hard
> resources:
> guaranteed:
> vcore: 5
> memory: 5Gi
> max:
> vcore: 5
> memory: 5Gi {code}
> h2. Spark-submit command
> {code:java}
> ./bin/spark-submit \
> --master k8s://https://10.10.10.10:6443 \
> --deploy-mode cluster \
> --name spark-pi \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=sparksa \
> --conf spark.kubernetes.namespace=spark \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.executor.instances=1 \
> --conf spark.executor.cores=1 \
> --conf spark.executor.memory=1500m \
> --conf spark.driver.cores=1 \
> --conf spark.driver.memory=1500m \
> --conf spark.kubernetes.driver.limit.cores=1 \
> --conf spark.kubernetes.driver.limit.memory=2G \
> --conf spark.kubernetes.executor.limit.cores=1 \
> --conf spark.kubernetes.executor.limit.memory=2G \
> --conf spark.kubernetes.driver.label.app=spark \
> --conf spark.kubernetes.executor.label.app=spark \
> --conf spark.kubernetes.container.image=apache/spark:v3.3.2 \
> --conf spark.kubernetes.scheduler.name=yunikorn \
> --conf spark.kubernetes.driver.label.queue=root.spark \
> --conf spark.kubernetes.executor.label.queue=root.spark \
> --conf
> spark.kubernetes.driver.annotation.yunikorn.apache.org/app-id={{APP_ID}} \
> --conf
> spark.kubernetes.executor.annotation.yunikorn.apache.org/app-id={{APP_ID}} \
> --conf
> spark.kubernetes.driver.annotation.yunikorn.apache.org/task-group-name=spark-driver
> \
> --conf
> spark.kubernetes.driver.annotation.yunikorn.apache.org/task-groups='[{"name":
> "spark-driver", "minMember": 1, "minResource": {"cpu": "1", "memory":
> "2Gi"},"nodeSelector": {"app": "spark"} }, {"name": "spark-executor",
> "minMember": 1, "minResource": {"cpu": "1", "memory": "2Gi"},"nodeSelector":
> {"app": "spark"} }]' \
> --conf
> spark.kubernetes.driver.annotation.yunikorn.apache.org/schedulingPolicyParameters='placeholderTimeoutInSeconds=60
> gangSchedulingStyle=Hard' \
> --conf
> spark.kubernetes.executor.annotation.yunikorn.apache.org/task-group-name=spark-executor
> \
> local:///opt/spark/examples/jars/spark-examples_2.12-3.3.2.jar \
> 3000 {code}
>
> h2. scheduler logs
> kubectl logs yunikorn-scheduler-56f599846b-8fl7d yunikorn-scheduler-k8s -n
> yunikorn
> {code:java}
> 2024-06-18T15:15:26.201Z DEBUG general/general.go:141 pod added {"appType":
> "general", "Name": "spark-pi-f4f19b902beac663-driver", "Namespace": "spark"}
> 2024-06-18T15:15:26.201Z DEBUG utils/utils.go:305 Unable to retrieve user
> name from pod labels. Empty user label {"userLabel":
> "yunikorn.apache.org/username"}
> 2024-06-18T15:15:26.201Z DEBUG cache/context.go:737 AddApplication
> {"Request":
> {"Metadata":{"ApplicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","QueueName":"root.spark","User":"nobody","Tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60
> gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\":
> \"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\",
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\":
> \"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\",
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"}
> }]"},"Groups":null,"TaskGroups":[{"name":"spark-driver","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}},{"name":"spark-executor","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}}],"OwnerReferences":[{"apiVersion":"v1","kind":"Pod","name":"spark-pi-f4f19b902beac663-driver","uid":"4fb897dd-3af5-4799-a09d-640b5222ba3a","controller":false,"blockOwnerDeletion":true}],"SchedulingPolicyParameters":{},"CreationTime":0}}}
> 2024-06-18T15:15:26.201Z DEBUG cache/context.go:746 app namespace info
> {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "namespace": "spark"}
> 2024-06-18T15:15:26.201Z INFO cache/context.go:773 app added {"appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.201Z DEBUG cache/context.go:841 AddTask {"appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:26.201Z DEBUG cache/context.go:233 adding pod to cache
> {"podName": "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:26.201Z INFO cache/context.go:863 task added {"appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskState": "New"}
> 2024-06-18T15:15:26.201Z INFO cache/context.go:873 app request originating
> pod added {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "original
> task": "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:26.201Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (AddPod.Pre) {"nodes": 3, "pods": 55, "assumed": 0,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases":
> {"Pending":2,"Running":53}}
> 2024-06-18T15:15:26.201Z DEBUG external/scheduler_cache.go:411 Putting pod
> in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:26.201Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (AddPod.Post) {"nodes": 3, "pods": 56, "assumed": 0,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:26.277Z DEBUG scheduler/scheduler.go:157 inspect
> outstanding requests
> 2024-06-18T15:15:26.924Z DEBUG cache/application_state.go:508 shim app
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source":
> "New", "destination": "Submitted", "event": "SubmitApplication"}
> 2024-06-18T15:15:26.924Z INFO cache/application.go:424 handle app
> submission {"app": "applicationID: spark-8e410a4c5ce44da2aa85ba835257a1e9,
> queue: root.spark, partition: default, totalNumOfTasks: 1, currentState:
> Submitted", "clusterID": "mycluster"}
> 2024-06-18T15:15:26.924Z DEBUG scheduler/scheduler.go:96 enqueued event
> {"eventType": "*rmevent.RMUpdateApplicationEvent", "event":
> {"Request":{"new":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","queueName":"root.spark","partitionName":"[mycluster]default","ugi":{"user":"nobody"},"tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60
> gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\":
> \"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\",
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\":
> \"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\",
> \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"}
> }]"},"executionTimeoutMilliSeconds":60000,"placeholderAsk":{"resources":{"memory":{"value":4294967296},"pods":{"value":2},"vcore":{"value":2000}}},"gangSchedulingStyle":"Hard"}],"rmID":"mycluster"}},
> "currentQueueSize": 0}
> 2024-06-18T15:15:26.924Z DEBUG placement/placement.go:145 Executing rule
> for placing application {"ruleName": "tag", "application":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.924Z DEBUG placement/tag_rule.go:106 Tag rule
> intermediate result {"application":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue": "root.spark"}
> 2024-06-18T15:15:26.924Z INFO placement/tag_rule.go:115 Tag rule
> application placed {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "queue": "root.spark"}
> 2024-06-18T15:15:26.924Z DEBUG placement/placement.go:204 Rule result for
> placing application {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "queueName": "root.spark"}
> 2024-06-18T15:15:26.925Z INFO scheduler/context.go:549 Added application
> to partition {"applicationID": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "partitionName": "[mycluster]default", "requested queue": "root.spark",
> "placed queue": "root.spark"}
> 2024-06-18T15:15:26.925Z DEBUG rmproxy/rmproxy.go:59 enqueue event
> {"eventType": "*rmevent.RMApplicationUpdateEvent", "event":
> {"RmID":"mycluster","AcceptedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9"}],"RejectedApplications":[],"UpdatedApplications":null},
> "currentQueueSize": 0}
> 2024-06-18T15:15:26.925Z DEBUG callback/scheduler_callback.go:108
> UpdateApplication callback received {"UpdateApplicationResponse":
> "accepted:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" > "}
> 2024-06-18T15:15:26.925Z DEBUG callback/scheduler_callback.go:114 callback:
> response to accepted application {"appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.925Z INFO callback/scheduler_callback.go:118 Accepting
> app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:26.925Z DEBUG cache/application_state.go:508 shim app
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source":
> "Submitted", "destination": "Accepted", "event": "AcceptApplication"}
> 2024-06-18T15:15:27.277Z DEBUG scheduler/scheduler.go:157 inspect
> outstanding requests
> 2024-06-18T15:15:27.925Z DEBUG cache/application.go:516 postAppAccepted on
> cached app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "numTaskGroups": 2, "numAllocatedTasks": 0}
> 2024-06-18T15:15:27.925Z INFO cache/application.go:526 app has taskGroups
> defined, trying to reserve resources for gang members {"appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:27.925Z DEBUG cache/application_state.go:508 shim app
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source":
> "Accepted", "destination": "Reserving", "event": "TryReserve"}
> 2024-06-18T15:15:27.933Z ERROR cache/placeholder_manager.go:99 failed to
> create placeholder pod {"error": "pods
> \"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed
> quota: spark-quota: must specify limits.cpu,limits.memory"}
> github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
> github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99
> github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
> github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542
> 2024-06-18T15:15:27.933Z INFO cache/placeholder_manager.go:115 start to
> clean up app placeholders {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:27.933Z INFO cache/placeholder_manager.go:128 finished
> cleaning up app placeholders {"appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
> 2024-06-18T15:15:27.933Z DEBUG cache/application_state.go:508 shim app
> state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source":
> "Reserving", "destination": "Running", "event": "RunApplication"}
> 2024-06-18T15:15:28.278Z DEBUG scheduler/scheduler.go:157 inspect
> outstanding requests
> 2024-06-18T15:15:28.867Z DEBUG scheduler/partition_manager.go:83 time
> consumed for queue cleaner {"duration": "6.41µs"}
> 2024-06-18T15:15:28.925Z INFO cache/task_state.go:380 Task state transition
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias":
> "spark/spark-pi-f4f19b902beac663-driver", "source": "New", "destination":
> "Pending", "event": "InitTask"}
> 2024-06-18T15:15:28.926Z INFO cache/task_state.go:380 Task state transition
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias":
> "spark/spark-pi-f4f19b902beac663-driver", "source": "Pending", "destination":
> "Scheduling", "event": "SubmitTask"}
> 2024-06-18T15:15:28.926Z DEBUG cache/task.go:275 scheduling pod {"podName":
> "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:28.926Z DEBUG cache/task.go:294 send update request
> {"request": "asks:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\"
> applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\"
> resourceAsk:<resources:<key:\"memory\" value:<value:1975517184 > >
> resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\"
> value:<value:1000 > > > maxAllocations:1
> tags:<key:\"kubernetes.io/label/app\" value:\"spark\" >
> tags:<key:\"kubernetes.io/label/queue\" value:\"root.spark\" >
> tags:<key:\"kubernetes.io/label/spark-app-name\" value:\"spark-pi\" >
> tags:<key:\"kubernetes.io/label/spark-app-selector\"
> value:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" >
> tags:<key:\"kubernetes.io/label/spark-role\" value:\"driver\" >
> tags:<key:\"kubernetes.io/label/spark-version\" value:\"3.3.2\" >
> tags:<key:\"kubernetes.io/meta/namespace\" value:\"spark\" >
> tags:<key:\"kubernetes.io/meta/podName\"
> value:\"spark-pi-f4f19b902beac663-driver\" > taskGroupName:\"spark-driver\"
> Originator:true preemptionPolicy:<allowPreemptSelf:true
> allowPreemptOther:true > > rmID:\"mycluster\" "}
> 2024-06-18T15:15:28.926Z DEBUG scheduler/scheduler.go:96 enqueued event
> {"eventType": "*rmevent.RMUpdateAllocationEvent", "event":
> {"Request":{"asks":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","partitionName":"[mycluster]default","resourceAsk":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"maxAllocations":1,"tags":{"kubernetes.io/label/app":"spark","kubernetes.io/label/queue":"root.spark","kubernetes.io/label/spark-app-name":"spark-pi","kubernetes.io/label/spark-app-selector":"spark-8e410a4c5ce44da2aa85ba835257a1e9","kubernetes.io/label/spark-role":"driver","kubernetes.io/label/spark-version":"3.3.2","kubernetes.io/meta/namespace":"spark","kubernetes.io/meta/podName":"spark-pi-f4f19b902beac663-driver"},"taskGroupName":"spark-driver","Originator":true,"preemptionPolicy":{"allowPreemptSelf":true,"allowPreemptOther":true}}],"rmID":"mycluster"}},
> "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z INFO objects/application_state.go:133 Application
> state transition {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "source": "New", "destination": "Accepted", "event": "runApplication"}
> 2024-06-18T15:15:28.926Z DEBUG rmproxy/rmproxy.go:59 enqueue event
> {"eventType": "*rmevent.RMApplicationUpdateEvent", "event":
> {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Accepted","stateTransitionTimestamp":1718723728926391633,"message":"runApplication"}]},
> "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z INFO objects/application.go:668 ask added
> successfully to application {"appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "ask":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "placeholder": false, "pendingDelta":
> "map[memory:1975517184 pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:108
> UpdateApplication callback received {"UpdateApplicationResponse":
> "updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\"
> state:\"Accepted\" stateTransitionTimestamp:1718723728926391633
> message:\"runApplication\" > "}
> 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:137 status
> update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "new status": "Accepted"}
> 2024-06-18T15:15:28.926Z DEBUG objects/application.go:339 Application state
> timer initiated {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "state":
> "Starting", "timeout": "5m0s"}
> 2024-06-18T15:15:28.926Z INFO objects/application_state.go:133 Application
> state transition {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "source": "Accepted", "destination": "Starting", "event": "runApplication"}
> 2024-06-18T15:15:28.926Z DEBUG rmproxy/rmproxy.go:59 enqueue event
> {"eventType": "*rmevent.RMApplicationUpdateEvent", "event":
> {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Starting","stateTransitionTimestamp":1718723728926635454,"message":"runApplication"}]},
> "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z DEBUG ugm/manager.go:63 Increasing resource usage
> {"user": "nobody", "queue path": "root.spark", "application":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue
> tracker object for queue {"queue": "root"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource
> usage {"queue path": "root.spark", "application":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue
> tracker object for queue {"queue": "spark"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource
> usage {"queue path": "spark", "application":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/manager.go:257 Group tracker does not
> exist. Creating group tracker object and linking the same with application
> {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue path":
> "root.spark", "user": "nobody", "group": "nobody"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue
> tracker object for queue {"queue": "root"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource
> usage {"queue path": "root.spark", "application":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue
> tracker object for queue {"queue": "spark"}
> 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource
> usage {"queue path": "spark", "application":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184
> pods:1 vcore:1000]"}
> 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:108
> UpdateApplication callback received {"UpdateApplicationResponse":
> "updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\"
> state:\"Starting\" stateTransitionTimestamp:1718723728926635454
> message:\"runApplication\" > "}
> 2024-06-18T15:15:28.926Z DEBUG objects/queue.go:1239 allocation found on
> queue {"queueName": "root.spark", "appID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "allocation":
> "applicationID=spark-8e410a4c5ce44da2aa85ba835257a1e9,
> uuid=3e42c4e0-0fc3-466c-a524-444b2eec700b,
> allocationKey=4fb897dd-3af5-4799-a09d-640b5222ba3a, Node=10.10.10.66,
> result=Allocated"}
> 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:137 status
> update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "new status": "Starting"}
> 2024-06-18T15:15:28.926Z INFO scheduler/partition.go:888 scheduler
> allocation processed {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
> "allocationKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "uuid":
> "3e42c4e0-0fc3-466c-a524-444b2eec700b", "allocatedResource":
> "map[memory:1975517184 pods:1 vcore:1000]", "placeholder": false,
> "targetNode": "10.10.10.66"}
> 2024-06-18T15:15:28.926Z DEBUG rmproxy/rmproxy.go:59 enqueue event
> {"eventType": "*rmevent.RMNewAllocationsEvent", "event":
> {"RmID":"mycluster","Allocations":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","UUID":"3e42c4e0-0fc3-466c-a524-444b2eec700b","resourcePerAlloc":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"nodeID":"10.10.10.66","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","taskGroupName":"spark-driver"}]},
> "currentQueueSize": 0}
> 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:48
> UpdateAllocation callback received {"UpdateAllocationResponse":
> "new:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\"
> UUID:\"3e42c4e0-0fc3-466c-a524-444b2eec700b\"
> resourcePerAlloc:<resources:<key:\"memory\" value:<value:1975517184 > >
> resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\"
> value:<value:1000 > > > nodeID:\"10.10.10.66\"
> applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\"
> taskGroupName:\"spark-driver\" > "}
> 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:53 callback:
> response to new allocation {"allocationKey":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "UUID":
> "3e42c4e0-0fc3-466c-a524-444b2eec700b", "applicationID":
> "spark-8e410a4c5ce44da2aa85ba835257a1e9", "nodeID": "10.10.10.66"}
> 2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (AssumePod.Pre) {"nodes": 3, "pods": 56, "assumed": 0,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:476 Adding
> assumed pod to cache {"podName": "spark-pi-f4f19b902beac663-driver",
> "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "node": "10.10.10.66",
> "allBound": true}
> 2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:411 Putting pod
> in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (AssumePod.Post) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.926Z DEBUG scheduler/context.go:853 Successfully synced
> shim on new allocation. response: no. of allocations: 1
> 2024-06-18T15:15:28.926Z INFO cache/task_state.go:380 Task state transition
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias":
> "spark/spark-pi-f4f19b902beac663-driver", "source": "Scheduling",
> "destination": "Allocated", "event": "TaskAllocated"}
> 2024-06-18T15:15:28.926Z DEBUG cache/task.go:349 bind pod volumes
> {"podName": "spark-pi-f4f19b902beac663-driver", "podUID":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.926Z INFO cache/context.go:499 Binding Pod Volumes
> skipped: all volumes already bound {"podName":
> "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:28.926Z DEBUG cache/task.go:362 bind pod {"podName":
> "spark-pi-f4f19b902beac663-driver", "podUID":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.926Z INFO client/kubeclient.go:112 bind pod to node
> {"podName": "spark-pi-f4f19b902beac663-driver", "podUID":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "nodeID": "10.10.10.66"}
> 2024-06-18T15:15:28.933Z INFO cache/task.go:375 successfully bound pod
> {"podName": "spark-pi-f4f19b902beac663-driver"}
> 2024-06-18T15:15:28.934Z INFO cache/task_state.go:380 Task state transition
> {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias":
> "spark/spark-pi-f4f19b902beac663-driver", "source": "Allocated",
> "destination": "Bound", "event": "TaskBound"}
> 2024-06-18T15:15:28.934Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.934Z DEBUG external/scheduler_cache.go:411 Putting pod
> in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.934Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.946Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:28.946Z DEBUG external/scheduler_cache.go:411 Putting pod
> in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:28.946Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:29.278Z DEBUG scheduler/scheduler.go:157 inspect
> outstanding requests
> 2024-06-18T15:15:29.906Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:29.906Z DEBUG external/scheduler_cache.go:411 Putting pod
> in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:29.906Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:30.113Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":3,"Running":53}}
> 2024-06-18T15:15:30.113Z DEBUG external/scheduler_cache.go:411 Putting pod
> in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
> "4fb897dd-3af5-4799-a09d-640b5222ba3a"}
> 2024-06-18T15:15:30.113Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 0,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":2,"Running":54}}
> 2024-06-18T15:15:30.279Z DEBUG scheduler/scheduler.go:157 inspect
> outstanding requests
> 2024-06-18T15:15:31.279Z DEBUG scheduler/scheduler.go:157 inspect
> outstanding requests
> 2024-06-18T15:15:32.280Z DEBUG scheduler/scheduler.go:157 inspect
> outstanding requests
> 2024-06-18T15:15:32.395Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdateNode.Pre) {"nodes": 3, "pods": 56, "assumed": 0,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":2,"Running":54}}
> 2024-06-18T15:15:32.395Z DEBUG external/scheduler_cache.go:179 Updating node
> in cache {"nodeName": "10.10.10.66"}
> 2024-06-18T15:15:32.395Z DEBUG external/scheduler_cache.go:558 Scheduler
> cache state (UpdateNode.Post) {"nodes": 3, "pods": 56, "assumed": 0,
> "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
> {"Pending":2,"Running":54}}
> 2024-06-18T15:15:32.395Z DEBUG cache/node.go:109 set node capacity
> {"nodeID": "10.10.10.66", "capacity": "resources:<key:\"ephemeral-storage\"
> value:<value:478652923105 > > resources:<key:\"hugepages-1Gi\" value:<> >
> resources:<key:\"hugepages-2Mi\" value:<> >
> resources:<key:\"kubernetes.io/batch-cpu\" value:<value:5167 > >
> resources:<key:\"kubernetes.io/batch-memory\" value:<value:34437364710 > >
> resources:<key:\"kubernetes.io/mid-cpu\" value:<value:3156 > >
> resources:<key:\"kubernetes.io/mid-memory\" value:<value:1442139746 > >
> resources:<key:\"memory\" value:<value:64713940992 > >
> resources:<key:\"pods\" value:<value:110 > > resources:<key:\"vcore\"
> value:<value:14000 > > "}
> 2024-06-18T15:15:32.396Z INFO cache/nodes.go:179 Node's ready status flag
> {"Node name": "10.10.10.66", "ready": true}
> 2024-06-18T15:15:32.396Z INFO cache/nodes.go:184 report updated nodes to
> scheduler {"request":
> {"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}}
> 2024-06-18T15:15:32.396Z DEBUG scheduler/scheduler.go:96 enqueued event
> {"eventType": "*rmevent.RMUpdateNodeEvent", "event":
> {"Request":{"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true","si/node-partition":"[mycluster]default"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}},
> "currentQueueSize": 0}
> 2024-06-18T15:15:32.396Z INFO objects/queue.go:1190 updating root queue max
> resources {"current max": "map[ephemeral-storage:1435958769315
> hugepages-1Gi:0 hugepages-2Mi:0 kubernetes.io/batch-cpu:19610
> kubernetes.io/batch-memory:109317941106 kubernetes.io/mid-cpu:9574
> kubernetes.io/mid-memory:5209251314 memory:194141839360 pods:330
> vcore:42000]", "new max": "map[ephemeral-storage:1435958769315
> hugepages-1Gi:0 hugepages-2Mi:0 kubernetes.io/batch-cpu:19309
> kubernetes.io/batch-memory:109530403859 kubernetes.io/mid-cpu:10574
> kubernetes.io/mid-memory:6651391060 memory:194141839360 pods:330
> vcore:42000]"}
> 2024-06-18T15:15:32.396Z DEBUG objects/node.go:182 skip updating
> occupiedResource, not changed {code}
> h2. Spark Pod status
> kubectl get pod -n spark|grep Running
> !image-2024-06-19-00-02-53-178.png!
> kubectl describe pod spark-pi-xxxxxxxx-driver -n spark
> !image-2024-06-19-00-03-09-703.png!
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]