huangzhir created YUNIKORN-2682:
-----------------------------------
Summary: YuniKorn Gang Scheduling Issue: Executors Failing to
Start When Running Multiple Applications
Key: YUNIKORN-2682
URL: https://issues.apache.org/jira/browse/YUNIKORN-2682
Project: Apache YuniKorn
Issue Type: Bug
Components: core - scheduler
Affects Versions: 1.3.0
Reporter: huangzhir
Attachments: image-2024-06-19-00-02-53-178.png,
image-2024-06-19-00-03-09-703.png
h3. Description:
While using YuniKorn's gang scheduling, we encountered a situation where the
scheduling process appears to succeed, but in reality, there is a problem. When
submitting two applications simultaneously, only the driver pods are
successfully running, and the executor pods fail to start due to insufficient
resources. The following error is observed in the scheduler logs:
{code:java}
2024-06-18T15:15:27.933Z ERROR cache/placeholder_manager.go:99 failed to create
placeholder pod {"error": "pods
\"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed quota:
spark-quota: must specify limits.cpu,limits.memory"}
github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99
github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542 {code}
h2. Environment:
* YuniKorn version: 1.3.0
* Kubernetes version: 1.21.3
* Spark version: 3.2.2
h2. *resource-quota.yaml*
{code:java}
apiVersion: v1
kind: ResourceQuota
metadata:
name: spark-quota
namespace: spark
spec:
hard:
requests.cpu: "5"
requests.memory: "5Gi"
limits.cpu: "5"
limits.memory: "5Gi" {code}
h2. yunikorn-configs.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: yunikorn-configs
namespace: yunikorn
data:
log.level: "-1"
log.admission.level: "-1"
log.core.config.level: "-1"
queues.yaml: |
partitions:
- name: default
placementrules:
- name: tag
value: namespace
create: true
queues:
- name: root
submitacl: '*'
properties:
application.sort.policy: fifo
placeholderTimeoutInSeconds: 60
schedulingStyle: Hard
queues:
- name: spark
properties:
application.sort.policy: fifo
placeholderTimeoutInSeconds: 60
schedulingStyle: Hard
resources:
guaranteed:
vcore: 5
memory: 5Gi
max:
vcore: 5
memory: 5Gi
h2. Spark-submit command
{code:java}
./bin/spark-submit \
--master k8s://https://10.10.10.10:6443 \
--deploy-mode cluster \
--name spark-pi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=sparksa \
--conf spark.kubernetes.namespace=spark \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=1 \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=1500m \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=1500m \
--conf spark.kubernetes.driver.limit.cores=1 \
--conf spark.kubernetes.driver.limit.memory=2G \
--conf spark.kubernetes.executor.limit.cores=1 \
--conf spark.kubernetes.executor.limit.memory=2G \
--conf spark.kubernetes.driver.label.app=spark \
--conf spark.kubernetes.executor.label.app=spark \
--conf spark.kubernetes.container.image=apache/spark:v3.3.2 \
--conf spark.kubernetes.scheduler.name=yunikorn \
--conf spark.kubernetes.driver.label.queue=root.spark \
--conf spark.kubernetes.executor.label.queue=root.spark \
--conf
spark.kubernetes.driver.annotation.yunikorn.apache.org/app-id={{APP_ID}} \
--conf
spark.kubernetes.executor.annotation.yunikorn.apache.org/app-id={{APP_ID}} \
--conf
spark.kubernetes.driver.annotation.yunikorn.apache.org/task-group-name=spark-driver
\
--conf
spark.kubernetes.driver.annotation.yunikorn.apache.org/task-groups='[{"name":
"spark-driver", "minMember": 1, "minResource": {"cpu": "1", "memory":
"2Gi"},"nodeSelector": {"app": "spark"} }, {"name": "spark-executor",
"minMember": 1, "minResource": {"cpu": "1", "memory": "2Gi"},"nodeSelector":
{"app": "spark"} }]' \
--conf
spark.kubernetes.driver.annotation.yunikorn.apache.org/schedulingPolicyParameters='placeholderTimeoutInSeconds=60
gangSchedulingStyle=Hard' \
--conf
spark.kubernetes.executor.annotation.yunikorn.apache.org/task-group-name=spark-executor
\
local:///opt/spark/examples/jars/spark-examples_2.12-3.3.2.jar \
3000 {code}
h2. scheduler logs
{code:java}
2024-06-18T15:15:26.201Z DEBUG general/general.go:141 pod added {"appType":
"general", "Name": "spark-pi-f4f19b902beac663-driver", "Namespace": "spark"}
2024-06-18T15:15:26.201Z DEBUG utils/utils.go:305 Unable to retrieve user
name from pod labels. Empty user label {"userLabel":
"yunikorn.apache.org/username"}
2024-06-18T15:15:26.201Z DEBUG cache/context.go:737 AddApplication
{"Request":
{"Metadata":{"ApplicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","QueueName":"root.spark","User":"nobody","Tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60
gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\":
\"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\",
\"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\":
\"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\",
\"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"}
}]"},"Groups":null,"TaskGroups":[{"name":"spark-driver","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}},{"name":"spark-executor","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}}],"OwnerReferences":[{"apiVersion":"v1","kind":"Pod","name":"spark-pi-f4f19b902beac663-driver","uid":"4fb897dd-3af5-4799-a09d-640b5222ba3a","controller":false,"blockOwnerDeletion":true}],"SchedulingPolicyParameters":{},"CreationTime":0}}}
2024-06-18T15:15:26.201Z DEBUG cache/context.go:746 app namespace info
{"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "namespace": "spark"}
2024-06-18T15:15:26.201Z INFO cache/context.go:773 app added {"appID":
"spark-8e410a4c5ce44da2aa85ba835257a1e9"}
2024-06-18T15:15:26.201Z DEBUG cache/context.go:841 AddTask {"appID":
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID":
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:26.201Z DEBUG cache/context.go:233 adding pod to cache
{"podName": "spark-pi-f4f19b902beac663-driver"}
2024-06-18T15:15:26.201Z INFO cache/context.go:863 task added {"appID":
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID":
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskState": "New"}
2024-06-18T15:15:26.201Z INFO cache/context.go:873 app request originating
pod added {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "original task":
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:26.201Z DEBUG external/scheduler_cache.go:558 Scheduler cache
state (AddPod.Pre) {"nodes": 3, "pods": 55, "assumed": 0, "pendingAllocs": 0,
"inProgressAllocs": 0, "podsAssigned": 55, "phases": {"Pending":2,"Running":53}}
2024-06-18T15:15:26.201Z DEBUG external/scheduler_cache.go:411 Putting pod in
cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:26.201Z DEBUG external/scheduler_cache.go:558 Scheduler cache
state (AddPod.Post) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 0,
"inProgressAllocs": 0, "podsAssigned": 55, "phases": {"Pending":3,"Running":53}}
2024-06-18T15:15:26.277Z DEBUG scheduler/scheduler.go:157 inspect outstanding
requests
2024-06-18T15:15:26.924Z DEBUG cache/application_state.go:508 shim app state
transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": "New",
"destination": "Submitted", "event": "SubmitApplication"}
2024-06-18T15:15:26.924Z INFO cache/application.go:424 handle app submission
{"app": "applicationID: spark-8e410a4c5ce44da2aa85ba835257a1e9, queue:
root.spark, partition: default, totalNumOfTasks: 1, currentState: Submitted",
"clusterID": "mycluster"}
2024-06-18T15:15:26.924Z DEBUG scheduler/scheduler.go:96 enqueued event
{"eventType": "*rmevent.RMUpdateApplicationEvent", "event":
{"Request":{"new":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","queueName":"root.spark","partitionName":"[mycluster]default","ugi":{"user":"nobody"},"tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60
gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\":
\"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\",
\"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\":
\"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\",
\"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"}
}]"},"executionTimeoutMilliSeconds":60000,"placeholderAsk":{"resources":{"memory":{"value":4294967296},"pods":{"value":2},"vcore":{"value":2000}}},"gangSchedulingStyle":"Hard"}],"rmID":"mycluster"}},
"currentQueueSize": 0}
2024-06-18T15:15:26.924Z DEBUG placement/placement.go:145 Executing rule for
placing application {"ruleName": "tag", "application":
"spark-8e410a4c5ce44da2aa85ba835257a1e9"}
2024-06-18T15:15:26.924Z DEBUG placement/tag_rule.go:106 Tag rule intermediate
result {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue":
"root.spark"}
2024-06-18T15:15:26.924Z INFO placement/tag_rule.go:115 Tag rule application
placed {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue":
"root.spark"}
2024-06-18T15:15:26.924Z DEBUG placement/placement.go:204 Rule result for
placing application {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
"queueName": "root.spark"}
2024-06-18T15:15:26.925Z INFO scheduler/context.go:549 Added application to
partition {"applicationID": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
"partitionName": "[mycluster]default", "requested queue": "root.spark", "placed
queue": "root.spark"}
2024-06-18T15:15:26.925Z DEBUG rmproxy/rmproxy.go:59 enqueue event
{"eventType": "*rmevent.RMApplicationUpdateEvent", "event":
{"RmID":"mycluster","AcceptedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9"}],"RejectedApplications":[],"UpdatedApplications":null},
"currentQueueSize": 0}
2024-06-18T15:15:26.925Z DEBUG callback/scheduler_callback.go:108
UpdateApplication callback received {"UpdateApplicationResponse":
"accepted:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" > "}
2024-06-18T15:15:26.925Z DEBUG callback/scheduler_callback.go:114 callback:
response to accepted application {"appID":
"spark-8e410a4c5ce44da2aa85ba835257a1e9"}
2024-06-18T15:15:26.925Z INFO callback/scheduler_callback.go:118 Accepting
app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
2024-06-18T15:15:26.925Z DEBUG cache/application_state.go:508 shim app state
transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source":
"Submitted", "destination": "Accepted", "event": "AcceptApplication"}
2024-06-18T15:15:27.277Z DEBUG scheduler/scheduler.go:157 inspect outstanding
requests
2024-06-18T15:15:27.925Z DEBUG cache/application.go:516 postAppAccepted on
cached app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "numTaskGroups":
2, "numAllocatedTasks": 0}
2024-06-18T15:15:27.925Z INFO cache/application.go:526 app has taskGroups
defined, trying to reserve resources for gang members {"appID":
"spark-8e410a4c5ce44da2aa85ba835257a1e9"}
2024-06-18T15:15:27.925Z DEBUG cache/application_state.go:508 shim app state
transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source":
"Accepted", "destination": "Reserving", "event": "TryReserve"}
2024-06-18T15:15:27.933Z ERROR cache/placeholder_manager.go:99 failed to
create placeholder pod {"error": "pods
\"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed quota:
spark-quota: must specify limits.cpu,limits.memory"}
github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99
github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542
2024-06-18T15:15:27.933Z INFO cache/placeholder_manager.go:115 start to
clean up app placeholders {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
2024-06-18T15:15:27.933Z INFO cache/placeholder_manager.go:128 finished
cleaning up app placeholders {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
2024-06-18T15:15:27.933Z DEBUG cache/application_state.go:508 shim app state
transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source":
"Reserving", "destination": "Running", "event": "RunApplication"}
2024-06-18T15:15:28.278Z DEBUG scheduler/scheduler.go:157 inspect outstanding
requests
2024-06-18T15:15:28.867Z DEBUG scheduler/partition_manager.go:83 time consumed
for queue cleaner {"duration": "6.41µs"}
2024-06-18T15:15:28.925Z INFO cache/task_state.go:380 Task state transition
{"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task":
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias":
"spark/spark-pi-f4f19b902beac663-driver", "source": "New", "destination":
"Pending", "event": "InitTask"}
2024-06-18T15:15:28.926Z INFO cache/task_state.go:380 Task state transition
{"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task":
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias":
"spark/spark-pi-f4f19b902beac663-driver", "source": "Pending", "destination":
"Scheduling", "event": "SubmitTask"}
2024-06-18T15:15:28.926Z DEBUG cache/task.go:275 scheduling pod {"podName":
"spark-pi-f4f19b902beac663-driver"}
2024-06-18T15:15:28.926Z DEBUG cache/task.go:294 send update request
{"request": "asks:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\"
applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\"
resourceAsk:<resources:<key:\"memory\" value:<value:1975517184 > >
resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\"
value:<value:1000 > > > maxAllocations:1 tags:<key:\"kubernetes.io/label/app\"
value:\"spark\" > tags:<key:\"kubernetes.io/label/queue\" value:\"root.spark\"
> tags:<key:\"kubernetes.io/label/spark-app-name\" value:\"spark-pi\" >
tags:<key:\"kubernetes.io/label/spark-app-selector\"
value:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" >
tags:<key:\"kubernetes.io/label/spark-role\" value:\"driver\" >
tags:<key:\"kubernetes.io/label/spark-version\" value:\"3.3.2\" >
tags:<key:\"kubernetes.io/meta/namespace\" value:\"spark\" >
tags:<key:\"kubernetes.io/meta/podName\"
value:\"spark-pi-f4f19b902beac663-driver\" > taskGroupName:\"spark-driver\"
Originator:true preemptionPolicy:<allowPreemptSelf:true allowPreemptOther:true
> > rmID:\"mycluster\" "}
2024-06-18T15:15:28.926Z DEBUG scheduler/scheduler.go:96 enqueued event
{"eventType": "*rmevent.RMUpdateAllocationEvent", "event":
{"Request":{"asks":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","partitionName":"[mycluster]default","resourceAsk":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"maxAllocations":1,"tags":{"kubernetes.io/label/app":"spark","kubernetes.io/label/queue":"root.spark","kubernetes.io/label/spark-app-name":"spark-pi","kubernetes.io/label/spark-app-selector":"spark-8e410a4c5ce44da2aa85ba835257a1e9","kubernetes.io/label/spark-role":"driver","kubernetes.io/label/spark-version":"3.3.2","kubernetes.io/meta/namespace":"spark","kubernetes.io/meta/podName":"spark-pi-f4f19b902beac663-driver"},"taskGroupName":"spark-driver","Originator":true,"preemptionPolicy":{"allowPreemptSelf":true,"allowPreemptOther":true}}],"rmID":"mycluster"}},
"currentQueueSize": 0}
2024-06-18T15:15:28.926Z INFO objects/application_state.go:133 Application
state transition {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source":
"New", "destination": "Accepted", "event": "runApplication"}
2024-06-18T15:15:28.926Z DEBUG rmproxy/rmproxy.go:59 enqueue event
{"eventType": "*rmevent.RMApplicationUpdateEvent", "event":
{"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Accepted","stateTransitionTimestamp":1718723728926391633,"message":"runApplication"}]},
"currentQueueSize": 0}
2024-06-18T15:15:28.926Z INFO objects/application.go:668 ask added
successfully to application {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
"ask": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "placeholder": false,
"pendingDelta": "map[memory:1975517184 pods:1 vcore:1000]"}
2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:108
UpdateApplication callback received {"UpdateApplicationResponse":
"updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\"
state:\"Accepted\" stateTransitionTimestamp:1718723728926391633
message:\"runApplication\" > "}
2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:137 status
update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
"new status": "Accepted"}
2024-06-18T15:15:28.926Z DEBUG objects/application.go:339 Application state
timer initiated {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "state":
"Starting", "timeout": "5m0s"}
2024-06-18T15:15:28.926Z INFO objects/application_state.go:133 Application
state transition {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source":
"Accepted", "destination": "Starting", "event": "runApplication"}
2024-06-18T15:15:28.926Z DEBUG rmproxy/rmproxy.go:59 enqueue event
{"eventType": "*rmevent.RMApplicationUpdateEvent", "event":
{"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Starting","stateTransitionTimestamp":1718723728926635454,"message":"runApplication"}]},
"currentQueueSize": 0}
2024-06-18T15:15:28.926Z DEBUG ugm/manager.go:63 Increasing resource usage
{"user": "nobody", "queue path": "root.spark", "application":
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184
pods:1 vcore:1000]"}
2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue tracker
object for queue {"queue": "root"}
2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource
usage {"queue path": "root.spark", "application":
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184
pods:1 vcore:1000]"}
2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue tracker
object for queue {"queue": "spark"}
2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource
usage {"queue path": "spark", "application":
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184
pods:1 vcore:1000]"}
2024-06-18T15:15:28.926Z DEBUG ugm/manager.go:257 Group tracker does not
exist. Creating group tracker object and linking the same with application
{"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue path":
"root.spark", "user": "nobody", "group": "nobody"}
2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue tracker
object for queue {"queue": "root"}
2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource
usage {"queue path": "root.spark", "application":
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184
pods:1 vcore:1000]"}
2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue tracker
object for queue {"queue": "spark"}
2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource
usage {"queue path": "spark", "application":
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184
pods:1 vcore:1000]"}
2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:108
UpdateApplication callback received {"UpdateApplicationResponse":
"updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\"
state:\"Starting\" stateTransitionTimestamp:1718723728926635454
message:\"runApplication\" > "}
2024-06-18T15:15:28.926Z DEBUG objects/queue.go:1239 allocation found on queue
{"queueName": "root.spark", "appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
"allocation": "applicationID=spark-8e410a4c5ce44da2aa85ba835257a1e9,
uuid=3e42c4e0-0fc3-466c-a524-444b2eec700b,
allocationKey=4fb897dd-3af5-4799-a09d-640b5222ba3a, Node=10.10.10.66,
result=Allocated"}
2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:137 status
update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
"new status": "Starting"}
2024-06-18T15:15:28.926Z INFO scheduler/partition.go:888 scheduler
allocation processed {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9",
"allocationKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "uuid":
"3e42c4e0-0fc3-466c-a524-444b2eec700b", "allocatedResource":
"map[memory:1975517184 pods:1 vcore:1000]", "placeholder": false, "targetNode":
"10.10.10.66"}
2024-06-18T15:15:28.926Z DEBUG rmproxy/rmproxy.go:59 enqueue event
{"eventType": "*rmevent.RMNewAllocationsEvent", "event":
{"RmID":"mycluster","Allocations":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","UUID":"3e42c4e0-0fc3-466c-a524-444b2eec700b","resourcePerAlloc":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"nodeID":"10.10.10.66","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","taskGroupName":"spark-driver"}]},
"currentQueueSize": 0}
2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:48
UpdateAllocation callback received {"UpdateAllocationResponse":
"new:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\"
UUID:\"3e42c4e0-0fc3-466c-a524-444b2eec700b\"
resourcePerAlloc:<resources:<key:\"memory\" value:<value:1975517184 > >
resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\"
value:<value:1000 > > > nodeID:\"10.10.10.66\"
applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\"
taskGroupName:\"spark-driver\" > "}
2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:53 callback:
response to new allocation {"allocationKey":
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "UUID":
"3e42c4e0-0fc3-466c-a524-444b2eec700b", "applicationID":
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "nodeID": "10.10.10.66"}
2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:558 Scheduler cache
state (AssumePod.Pre) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs":
0, "inProgressAllocs": 0, "podsAssigned": 55, "phases":
{"Pending":3,"Running":53}}
2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:476 Adding assumed
pod to cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "node": "10.10.10.66", "allBound": true}
2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:411 Putting pod in
cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:558 Scheduler cache
state (AssumePod.Post) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs":
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
{"Pending":3,"Running":53}}
2024-06-18T15:15:28.926Z DEBUG scheduler/context.go:853 Successfully synced
shim on new allocation. response: no. of allocations: 1
2024-06-18T15:15:28.926Z INFO cache/task_state.go:380 Task state transition
{"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task":
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias":
"spark/spark-pi-f4f19b902beac663-driver", "source": "Scheduling",
"destination": "Allocated", "event": "TaskAllocated"}
2024-06-18T15:15:28.926Z DEBUG cache/task.go:349 bind pod volumes {"podName":
"spark-pi-f4f19b902beac663-driver", "podUID":
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:28.926Z INFO cache/context.go:499 Binding Pod Volumes
skipped: all volumes already bound {"podName":
"spark-pi-f4f19b902beac663-driver"}
2024-06-18T15:15:28.926Z DEBUG cache/task.go:362 bind pod {"podName":
"spark-pi-f4f19b902beac663-driver", "podUID":
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:28.926Z INFO client/kubeclient.go:112 bind pod to node
{"podName": "spark-pi-f4f19b902beac663-driver", "podUID":
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "nodeID": "10.10.10.66"}
2024-06-18T15:15:28.933Z INFO cache/task.go:375 successfully bound pod
{"podName": "spark-pi-f4f19b902beac663-driver"}
2024-06-18T15:15:28.934Z INFO cache/task_state.go:380 Task state transition
{"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task":
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias":
"spark/spark-pi-f4f19b902beac663-driver", "source": "Allocated", "destination":
"Bound", "event": "TaskBound"}
2024-06-18T15:15:28.934Z DEBUG external/scheduler_cache.go:558 Scheduler cache
state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs":
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
{"Pending":3,"Running":53}}
2024-06-18T15:15:28.934Z DEBUG external/scheduler_cache.go:411 Putting pod in
cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:28.934Z DEBUG external/scheduler_cache.go:558 Scheduler cache
state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs":
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
{"Pending":3,"Running":53}}
2024-06-18T15:15:28.946Z DEBUG external/scheduler_cache.go:558 Scheduler cache
state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs":
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
{"Pending":3,"Running":53}}
2024-06-18T15:15:28.946Z DEBUG external/scheduler_cache.go:411 Putting pod in
cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:28.946Z DEBUG external/scheduler_cache.go:558 Scheduler cache
state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs":
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
{"Pending":3,"Running":53}}
2024-06-18T15:15:29.278Z DEBUG scheduler/scheduler.go:157 inspect outstanding
requests
2024-06-18T15:15:29.906Z DEBUG external/scheduler_cache.go:558 Scheduler cache
state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs":
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
{"Pending":3,"Running":53}}
2024-06-18T15:15:29.906Z DEBUG external/scheduler_cache.go:411 Putting pod in
cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:29.906Z DEBUG external/scheduler_cache.go:558 Scheduler cache
state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs":
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
{"Pending":3,"Running":53}}
2024-06-18T15:15:30.113Z DEBUG external/scheduler_cache.go:558 Scheduler cache
state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs":
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
{"Pending":3,"Running":53}}
2024-06-18T15:15:30.113Z DEBUG external/scheduler_cache.go:411 Putting pod in
cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey":
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:30.113Z DEBUG external/scheduler_cache.go:558 Scheduler cache
state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs":
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
{"Pending":2,"Running":54}}
2024-06-18T15:15:30.279Z DEBUG scheduler/scheduler.go:157 inspect outstanding
requests
2024-06-18T15:15:31.279Z DEBUG scheduler/scheduler.go:157 inspect outstanding
requests
2024-06-18T15:15:32.280Z DEBUG scheduler/scheduler.go:157 inspect outstanding
requests
2024-06-18T15:15:32.395Z DEBUG external/scheduler_cache.go:558 Scheduler cache
state (UpdateNode.Pre) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs":
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
{"Pending":2,"Running":54}}
2024-06-18T15:15:32.395Z DEBUG external/scheduler_cache.go:179 Updating node
in cache {"nodeName": "10.10.10.66"}
2024-06-18T15:15:32.395Z DEBUG external/scheduler_cache.go:558 Scheduler cache
state (UpdateNode.Post) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs":
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases":
{"Pending":2,"Running":54}}
2024-06-18T15:15:32.395Z DEBUG cache/node.go:109 set node capacity {"nodeID":
"10.10.10.66", "capacity": "resources:<key:\"ephemeral-storage\"
value:<value:478652923105 > > resources:<key:\"hugepages-1Gi\" value:<> >
resources:<key:\"hugepages-2Mi\" value:<> >
resources:<key:\"kubernetes.io/batch-cpu\" value:<value:5167 > >
resources:<key:\"kubernetes.io/batch-memory\" value:<value:34437364710 > >
resources:<key:\"kubernetes.io/mid-cpu\" value:<value:3156 > >
resources:<key:\"kubernetes.io/mid-memory\" value:<value:1442139746 > >
resources:<key:\"memory\" value:<value:64713940992 > > resources:<key:\"pods\"
value:<value:110 > > resources:<key:\"vcore\" value:<value:14000 > > "}
2024-06-18T15:15:32.396Z INFO cache/nodes.go:179 Node's ready status flag
{"Node name": "10.10.10.66", "ready": true}
2024-06-18T15:15:32.396Z INFO cache/nodes.go:184 report updated nodes to
scheduler {"request":
{"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}}
2024-06-18T15:15:32.396Z DEBUG scheduler/scheduler.go:96 enqueued event
{"eventType": "*rmevent.RMUpdateNodeEvent", "event":
{"Request":{"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true","si/node-partition":"[mycluster]default"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}},
"currentQueueSize": 0}
2024-06-18T15:15:32.396Z INFO objects/queue.go:1190 updating root queue max
resources {"current max": "map[ephemeral-storage:1435958769315 hugepages-1Gi:0
hugepages-2Mi:0 kubernetes.io/batch-cpu:19610
kubernetes.io/batch-memory:109317941106 kubernetes.io/mid-cpu:9574
kubernetes.io/mid-memory:5209251314 memory:194141839360 pods:330 vcore:42000]",
"new max": "map[ephemeral-storage:1435958769315 hugepages-1Gi:0 hugepages-2Mi:0
kubernetes.io/batch-cpu:19309 kubernetes.io/batch-memory:109530403859
kubernetes.io/mid-cpu:10574 kubernetes.io/mid-memory:6651391060
memory:194141839360 pods:330 vcore:42000]"}
2024-06-18T15:15:32.396Z DEBUG objects/node.go:182 skip updating
occupiedResource, not changed {code}
h2. Spark Pod status
kubectl get pod -n spark|grep Running
!image-2024-06-19-00-02-53-178.png!
kubectl describe pod spark-pi-xxxxxxxx-driver -n spark
!image-2024-06-19-00-03-09-703.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]