[
https://issues.apache.org/jira/browse/YUNIKORN-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiaobao Wu updated YUNIKORN-2940:
---------------------------------
Description:
*Environment information*
* resourceQuotas:4C 4G
* driver / executor Pod:1C 1G
* driver / executor ph Pod (in task-groups): 1C 1G
*Issue description*
In the above environment, I submitted a Spark job, and the job information is
as follows :
{code:java}
/opt/spark/bin/spark-submit --master k8s://https://127.0.0.1:6443 --deploy-mode
cluster --name spark-pi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.namespace=spark-my-test \
--class org.apache.spark.examples.SparkPi \
--conf spark.dynamicAllocation.shuffleTracking.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=10 \
--conf spark.dynamicAllocation.minExecutors=10 \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=600m \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=600m \
--conf spark.app.id={{APP_ID}} \
--conf spark.ui.port=14040 \
--conf spark.kubernetes.driver.limit.cores=1 \
--conf spark.kubernetes.executor.limit.cores=1 \
--conf spark.kubernetes.container.image=apache/spark:v3.3.0 \
--conf spark.kubernetes.scheduler.name=yunikorn \
--conf
spark.kubernetes.driver.annotation.yunikorn.apache.org/task-group-name=spark-driver
\
--conf
spark.kubernetes.driver.annotation.yunikorn.apache.org/task-groups='[{"name":
"spark-driver", "minMember": 1, "minResource": {"cpu": "1", "memory":
"1Gi"} }, {"name": "spark-executor", "minMember": 10, "minResource": {"cpu":
"1", "memory": "1Gi"} }]' \
--conf
spark.kubernetes.driver.annotation.yunikorn.apache.org/schedulingPolicyParameters='placeholderTimeoutInSeconds=30
gangSchedulingStyle=Hard' \
--conf
spark.kubernetes.executor.annotation.yunikorn.apache.org/task-group-name=spark-executor
\
local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar 10000 {code}
After I ran this job, I found that ph Pod ( i.e.
tg-spark-96aae620780e4b40a59893-spark-executor-90qrmfkytv in the following
picture ) still exists on K8S.
!http://www.kdocs.cn/api/v3/office/copy/NjNBZFlyNDdCMXRSZEp0cEdTYVVocE94MkY3OVZzTHNMM2oyWFQ0ZVY2K0x6eE9qNTNDMDFzN3Z3QzA1ZCtrdEdwbU9FNm5xUHI4cTZKTDV6dnYvWHFjZUZlYjJMZS9UYXBrMERSVWYxNkhhd0pycnNkVEtxblh0d212K3dQZ0o0eXB6VWVEanlJbTRnSGlYVG12YUNrbS9tRnZzMkNneU82aGNWZzNIYmNVQmlnbmlVZ0VNS1lJZ0NNQzBSKzYwbGJ5SVd5MXFwSjhZUFllb2Rwc0Q1UCtwMlh4WkljSWxQN2FEczVBODhRdk5pSlVOcVllZVNjaklWVGFhQ0paaC9DZUpXS1hDRldrPQ==/attach/object/4BQON4Y3AAADQ?|width=664!
I think it is very strange why the job has been completed, but phPod still
exists.
*Issue analysis*
By looking at the log, I found that there is a key log here:
{code:java}
2024-10-21T20:50:19.868+0800 INFO shim.cache.placeholder
cache/placeholder_manager.go:99 placeholder created {"placeholder":
"appID: spark-96aae620780e4b40a59893d850e8aad3, taskGroup: spark-executor,
podName:
spark-my-test/tg-spark-96aae620780e4b40a59893-spark-executor-90qrmfkytv"}
2024-10-21T20:50:19.880+0800 ERROR shim.cache.placeholder
cache/placeholder_manager.go:95 failed to create placeholder pod
{"error": "pods \"tg-spark-96aae620780e4b40a59893-spark-executor-zkpqzmw308\"
is forbidden: exceeded quota: compute-resources, requested:
limits.cpu=1,limits.memory=1Gi,requests.cpu=1,requests.memory=1Gi, used:
limits.cpu=4,limits.memory=4056Mi,requests.cpu=4,requests.memory=4056Mi,
limited: limits.cpu=4,limits.memory=4Gi,requests.cpu=4,requests.memory=4Gi"}
github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
/opt/src/pkg/cache/placeholder_manager.go:95
github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
/opt/src/pkg/cache/application.go:537
2024-10-21T20:50:19.880+0800 INFO shim.cache.placeholder
cache/placeholder_manager.go:111 start to clean up app placeholders
{"appID": "spark-96aae620780e4b40a59893d850e8aad3"}
2024-10-21T20:50:19.973+0800 DEBUG shim.context cache/context.go:1109
AddTask {"appID": "spark-96aae620780e4b40a59893d850e8aad3", "taskID":
"08490d01-bb3b-490b-a9b0-b9bd183cccd6"}
2024-10-21T20:50:19.973+0800 INFO shim.context cache/context.go:1131
task added {"appID": "spark-96aae620780e4b40a59893d850e8aad3", "taskID":
"08490d01-bb3b-490b-a9b0-b9bd183cccd6", "taskState": "New"}
2024-10-21T20:50:20.058+0800 INFO shim.cache.placeholder
cache/placeholder_manager.go:124 finished cleaning up app placeholders
{"appID": "spark-96aae620780e4b40a59893d850e8aad3"}
2024-10-21T20:50:20.058+0800 DEBUG shim.fsm
cache/application_state.go:500 shim app state transition {"app":
"spark-96aae620780e4b40a59893d850e8aad3", "source": "Reserving", "destination":
"Running", "event": "UpdateReservation"}
{code}
According to the above log, the sequence of events is as follows :
1. Create ph Pod ( tg-spark-96aae620780e4b40a59893-spark-executor-90qrmfkytv )
2. Exceptions in the createAppPlaceholders method due to resourceQuotas
constraints
3. *start to clean up app placeholders*
4. Add task ( this task corresponds to
tg-spark-96aae620780e4b40a59893-spark-executor-90qrmfkytv )
5. finished cleaning up app placeholders
6. shim app status is transformed into Running
*Stage conclusion*
# When an exception ( e.g., resourceQuotas limit, etc. ) is encountered during
the creation of a ph Pod, it triggers the ph task cleanup mechanism.
# The cleanup mechanism for the ph Task deletes all tasks under the current
app.
# If a Task is added during the cleanup process, it will not be cleaned.
# The shim app will no longer handle ph Task after the state is converted to
Running ( this "handle" means pushing the InitTask event to the ph Task )
In short, in this scenario, it may cause this ph Pod long time Pending
problem.
*Follow-up improvement ideas*
1. Whether we could clean up the ph Task after the core app is completed?
2. Whether we could clean up the ph Task after the shim app turns to Running?
3. When the shim app state is converted to Running, we no longer accept the
addTask request of ph Pod, and feedback it to the submiter in the form of logs
or events.
was:
*Environment information*
* resourceQuotas:4C 4G
* driver / executor Pod:1C 1G
* driver / executor ph Pod (in task-groups): 1C 1G
*Issue description*
In the above environment, I submitted a Spark job, and the job information is
as follows :
{code:java}
/opt/spark/bin/spark-submit --master k8s://https://127.0.0.1:6443 --deploy-mode
cluster --name spark-pi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.namespace=spark-my-test \
--class org.apache.spark.examples.SparkPi \
--conf spark.dynamicAllocation.shuffleTracking.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=10 \
--conf spark.dynamicAllocation.minExecutors=10 \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=600m \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=600m \
--conf spark.app.id={{APP_ID}} \
--conf spark.ui.port=14040 \
--conf spark.kubernetes.driver.limit.cores=1 \
--conf spark.kubernetes.executor.limit.cores=1 \
--conf spark.kubernetes.container.image=apache/spark:v3.3.0 \
--conf spark.kubernetes.scheduler.name=yunikorn \
--conf
spark.kubernetes.driver.annotation.yunikorn.apache.org/task-group-name=spark-driver
\
--conf
spark.kubernetes.driver.annotation.yunikorn.apache.org/task-groups='[{"name":
"spark-driver", "minMember": 1, "minResource": {"cpu": "1", "memory":
"1Gi"} }, {"name": "spark-executor", "minMember": 10, "minResource": {"cpu":
"1", "memory": "1Gi"} }]' \
--conf
spark.kubernetes.driver.annotation.yunikorn.apache.org/schedulingPolicyParameters='placeholderTimeoutInSeconds=30
gangSchedulingStyle=Hard' \
--conf
spark.kubernetes.executor.annotation.yunikorn.apache.org/task-group-name=spark-executor
\
local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar 10000 {code}
After I ran this job, I found that ph Pod ( i.e.
tg-spark-96aae620780e4b40a59893-spark-executor-90qrmfkytv in the following
picture ) still exists on K8S.
!http://www.kdocs.cn/api/v3/office/copy/NjNBZFlyNDdCMXRSZEp0cEdTYVVocE94MkY3OVZzTHNMM2oyWFQ0ZVY2K0x6eE9qNTNDMDFzN3Z3QzA1ZCtrdEdwbU9FNm5xUHI4cTZKTDV6dnYvWHFjZUZlYjJMZS9UYXBrMERSVWYxNkhhd0pycnNkVEtxblh0d212K3dQZ0o0eXB6VWVEanlJbTRnSGlYVG12YUNrbS9tRnZzMkNneU82aGNWZzNIYmNVQmlnbmlVZ0VNS1lJZ0NNQzBSKzYwbGJ5SVd5MXFwSjhZUFllb2Rwc0Q1UCtwMlh4WkljSWxQN2FEczVBODhRdk5pSlVOcVllZVNjaklWVGFhQ0paaC9DZUpXS1hDRldrPQ==/attach/object/4BQON4Y3AAADQ?|width=664!
I think it is very strange why the job has been completed, but phPod still
exists.
*Issue analysis*
By looking at the log, I found that there is a key log here:
{code:java}
2024-10-21T20:50:19.868+0800 INFO shim.cache.placeholder
cache/placeholder_manager.go:99 placeholder created {"placeholder":
"appID: spark-96aae620780e4b40a59893d850e8aad3, taskGroup: spark-executor,
podName:
spark-my-test/tg-spark-96aae620780e4b40a59893-spark-executor-90qrmfkytv"}
2024-10-21T20:50:19.880+0800 ERROR shim.cache.placeholder
cache/placeholder_manager.go:95 failed to create placeholder pod
{"error": "pods \"tg-spark-96aae620780e4b40a59893-spark-executor-zkpqzmw308\"
is forbidden: exceeded quota: compute-resources, requested:
limits.cpu=1,limits.memory=1Gi,requests.cpu=1,requests.memory=1Gi, used:
limits.cpu=4,limits.memory=4056Mi,requests.cpu=4,requests.memory=4056Mi,
limited: limits.cpu=4,limits.memory=4Gi,requests.cpu=4,requests.memory=4Gi"}
github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
/opt/src/pkg/cache/placeholder_manager.go:95
github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
/opt/src/pkg/cache/application.go:537
2024-10-21T20:50:19.880+0800 INFO shim.cache.placeholder
cache/placeholder_manager.go:111 start to clean up app placeholders
{"appID": "spark-96aae620780e4b40a59893d850e8aad3"}
2024-10-21T20:50:19.973+0800 DEBUG shim.context cache/context.go:1109
AddTask {"appID": "spark-96aae620780e4b40a59893d850e8aad3", "taskID":
"08490d01-bb3b-490b-a9b0-b9bd183cccd6"}
2024-10-21T20:50:19.973+0800 INFO shim.context cache/context.go:1131
task added {"appID": "spark-96aae620780e4b40a59893d850e8aad3", "taskID":
"08490d01-bb3b-490b-a9b0-b9bd183cccd6", "taskState": "New"}
2024-10-21T20:50:20.058+0800 INFO shim.cache.placeholder
cache/placeholder_manager.go:124 finished cleaning up app placeholders
{"appID": "spark-96aae620780e4b40a59893d850e8aad3"}
2024-10-21T20:50:20.058+0800 DEBUG shim.fsm
cache/application_state.go:500 shim app state transition {"app":
"spark-96aae620780e4b40a59893d850e8aad3", "source": "Reserving", "destination":
"Running", "event": "UpdateReservation"}
{code}
According to the above log, the sequence of events is as follows :
1. Create ph Pod ( tg-spark-96aae620780e4b40a59893-spark-executor-90qrmfkytv )
2. Exceptions in the createAppPlaceholders method due to resourceQuotas
constraints
3. *start to clean up app placeholders*
4. Add task ( this task corresponds to
tg-spark-96aae620780e4b40a59893-spark-executor-90qrmfkytv )
5. finished cleaning up app placeholders
6. shim app status is transformed into Running
*Stage conclusion*
# When an exception ( e.g., resourceQuotas limit, etc. ) is encountered during
the creation of a ph Pod, it triggers the ph task cleanup mechanism.
# The cleanup mechanism for the ph Task deletes all tasks under the current
app.
# If a Task is added during the cleanup process, it will not be cleaned.
# The shim app will no longer handle ph Task after the state is converted to
Running ( this "handle" means pushing the InitTask event to the ph Task )
In short, in this scenario, it may cause this ph Pod long time Pending
problem.
*Follow-up improvement ideas*
1. Whether we could clean up the ph Task after the core app is completed?
2. Whether we could clean up the ph Task after the shim app turns to Running?
3. When the shim app state is converted to Running, we no longer accept the
addTask request of ph Pod, and feedback it to the submiter in the form of logs
or events.
> ph Pod is in a pending state for a long time
> --------------------------------------------
>
> Key: YUNIKORN-2940
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2940
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Affects Versions: 1.5.2
> Reporter: Xiaobao Wu
> Priority: Critical
>
> *Environment information*
> * resourceQuotas:4C 4G
> * driver / executor Pod:1C 1G
> * driver / executor ph Pod (in task-groups): 1C 1G
> *Issue description*
> In the above environment, I submitted a Spark job, and the job information is
> as follows :
> {code:java}
> /opt/spark/bin/spark-submit --master k8s://https://127.0.0.1:6443
> --deploy-mode cluster --name spark-pi \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.namespace=spark-my-test \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.dynamicAllocation.shuffleTracking.enabled=true \
> --conf spark.dynamicAllocation.enabled=true \
> --conf spark.dynamicAllocation.maxExecutors=10 \
> --conf spark.dynamicAllocation.minExecutors=10 \
> --conf spark.executor.cores=1 \
> --conf spark.executor.memory=600m \
> --conf spark.driver.cores=1 \
> --conf spark.driver.memory=600m \
> --conf spark.app.id={{APP_ID}} \
> --conf spark.ui.port=14040 \
> --conf spark.kubernetes.driver.limit.cores=1 \
> --conf spark.kubernetes.executor.limit.cores=1 \
> --conf spark.kubernetes.container.image=apache/spark:v3.3.0 \
> --conf spark.kubernetes.scheduler.name=yunikorn \
> --conf
> spark.kubernetes.driver.annotation.yunikorn.apache.org/task-group-name=spark-driver
> \
> --conf
> spark.kubernetes.driver.annotation.yunikorn.apache.org/task-groups='[{"name":
> "spark-driver", "minMember": 1, "minResource": {"cpu": "1", "memory":
> "1Gi"} }, {"name": "spark-executor", "minMember": 10, "minResource": {"cpu":
> "1", "memory": "1Gi"} }]' \
> --conf
> spark.kubernetes.driver.annotation.yunikorn.apache.org/schedulingPolicyParameters='placeholderTimeoutInSeconds=30
> gangSchedulingStyle=Hard' \
> --conf
> spark.kubernetes.executor.annotation.yunikorn.apache.org/task-group-name=spark-executor
> \
> local:///opt/spark/examples/jars/spark-examples_2.12-3.3.0.jar 10000 {code}
> After I ran this job, I found that ph Pod ( i.e.
> tg-spark-96aae620780e4b40a59893-spark-executor-90qrmfkytv in the following
> picture ) still exists on K8S.
> !http://www.kdocs.cn/api/v3/office/copy/NjNBZFlyNDdCMXRSZEp0cEdTYVVocE94MkY3OVZzTHNMM2oyWFQ0ZVY2K0x6eE9qNTNDMDFzN3Z3QzA1ZCtrdEdwbU9FNm5xUHI4cTZKTDV6dnYvWHFjZUZlYjJMZS9UYXBrMERSVWYxNkhhd0pycnNkVEtxblh0d212K3dQZ0o0eXB6VWVEanlJbTRnSGlYVG12YUNrbS9tRnZzMkNneU82aGNWZzNIYmNVQmlnbmlVZ0VNS1lJZ0NNQzBSKzYwbGJ5SVd5MXFwSjhZUFllb2Rwc0Q1UCtwMlh4WkljSWxQN2FEczVBODhRdk5pSlVOcVllZVNjaklWVGFhQ0paaC9DZUpXS1hDRldrPQ==/attach/object/4BQON4Y3AAADQ?|width=664!
> I think it is very strange why the job has been completed, but phPod still
> exists.
>
> *Issue analysis*
> By looking at the log, I found that there is a key log here:
>
> {code:java}
> 2024-10-21T20:50:19.868+0800 INFO shim.cache.placeholder
> cache/placeholder_manager.go:99 placeholder created {"placeholder":
> "appID: spark-96aae620780e4b40a59893d850e8aad3, taskGroup: spark-executor,
> podName:
> spark-my-test/tg-spark-96aae620780e4b40a59893-spark-executor-90qrmfkytv"}
> 2024-10-21T20:50:19.880+0800 ERROR shim.cache.placeholder
> cache/placeholder_manager.go:95 failed to create placeholder pod
> {"error": "pods \"tg-spark-96aae620780e4b40a59893-spark-executor-zkpqzmw308\"
> is forbidden: exceeded quota: compute-resources, requested:
> limits.cpu=1,limits.memory=1Gi,requests.cpu=1,requests.memory=1Gi, used:
> limits.cpu=4,limits.memory=4056Mi,requests.cpu=4,requests.memory=4056Mi,
> limited: limits.cpu=4,limits.memory=4Gi,requests.cpu=4,requests.memory=4Gi"}
> github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
> /opt/src/pkg/cache/placeholder_manager.go:95
> github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
> /opt/src/pkg/cache/application.go:537
> 2024-10-21T20:50:19.880+0800 INFO shim.cache.placeholder
> cache/placeholder_manager.go:111 start to clean up app placeholders
> {"appID": "spark-96aae620780e4b40a59893d850e8aad3"}
> 2024-10-21T20:50:19.973+0800 DEBUG shim.context
> cache/context.go:1109 AddTask {"appID":
> "spark-96aae620780e4b40a59893d850e8aad3", "taskID":
> "08490d01-bb3b-490b-a9b0-b9bd183cccd6"}
> 2024-10-21T20:50:19.973+0800 INFO shim.context cache/context.go:1131
> task added {"appID": "spark-96aae620780e4b40a59893d850e8aad3",
> "taskID": "08490d01-bb3b-490b-a9b0-b9bd183cccd6", "taskState": "New"}
> 2024-10-21T20:50:20.058+0800 INFO shim.cache.placeholder
> cache/placeholder_manager.go:124 finished cleaning up app placeholders
> {"appID": "spark-96aae620780e4b40a59893d850e8aad3"}
> 2024-10-21T20:50:20.058+0800 DEBUG shim.fsm
> cache/application_state.go:500 shim app state transition {"app":
> "spark-96aae620780e4b40a59893d850e8aad3", "source": "Reserving",
> "destination": "Running", "event": "UpdateReservation"}
> {code}
> According to the above log, the sequence of events is as follows :
> 1. Create ph Pod ( tg-spark-96aae620780e4b40a59893-spark-executor-90qrmfkytv )
> 2. Exceptions in the createAppPlaceholders method due to resourceQuotas
> constraints
> 3. *start to clean up app placeholders*
> 4. Add task ( this task corresponds to
> tg-spark-96aae620780e4b40a59893-spark-executor-90qrmfkytv )
> 5. finished cleaning up app placeholders
> 6. shim app status is transformed into Running
>
> *Stage conclusion*
> # When an exception ( e.g., resourceQuotas limit, etc. ) is encountered
> during the creation of a ph Pod, it triggers the ph task cleanup mechanism.
> # The cleanup mechanism for the ph Task deletes all tasks under the current
> app.
> # If a Task is added during the cleanup process, it will not be cleaned.
> # The shim app will no longer handle ph Task after the state is converted to
> Running ( this "handle" means pushing the InitTask event to the ph Task )
> In short, in this scenario, it may cause this ph Pod long time Pending
> problem.
>
> *Follow-up improvement ideas*
> 1. Whether we could clean up the ph Task after the core app is completed?
> 2. Whether we could clean up the ph Task after the shim app turns to Running?
> 3. When the shim app state is converted to Running, we no longer accept the
> addTask request of ph Pod, and feedback it to the submiter in the form of
> logs or events.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]