[
https://issues.apache.org/jira/browse/YUNIKORN-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei Huang updated YUNIKORN-1706:
--------------------------------
Description:
I'm running a local dev env *make run_plugin* based on 1.2.0, no admission
controller is configured. Additionally, I configured a configmap in the default
namespace:
{code:bash}
apiVersion: v1
data:
queues.yaml: |
partitions:
- name: default
nodesortpolicy:
type: binpacking
queues:
- name: root
submitacl: '*'
queues:
- name: app1
submitacl: '*'
properties:
application.sort.policy: fifo
resources:
max:
{memory: 200G, vcore: 1000}
kind: ConfigMap
metadata:
name: yunikorn-configs
{code}
Then I create a Pod with the following config:
{code:bash}
kind: Pod
apiVersion: v1
metadata:
name: pod-1
labels:
applicationId: "app1"
spec:
schedulerName: yunikorn
containers:
- name: pause
image: registry.k8s.io/pause:3.6
resources:
requests:
cpu: 1
{code}
The pod cannot be scheduled with a status {*}ApplicationRejected{*}, and I
observed log in the shim as:
{code:bash}
2023-04-21T16:34:42.354-0700 INFO cache/context.go:741 app added
{"appID": "app1"}
2023-04-21T16:34:42.354-0700 INFO cache/context.go:831 task added
{"appID": "app1", "taskID": "d643a5ad-c93b-4d99-8eac-9418fbac18b0",
"taskState": "New"}
2023-04-21T16:34:42.355-0700 INFO cache/context.go:841 app request
originating pod added {"appID": "app1", "original task":
"d643a5ad-c93b-4d99-8eac-9418fbac18b0"}
I0421 16:34:42.355111 46423 factory.go:344] "Unable to schedule pod; no fit;
waiting" pod="default/pod-1" err="0/1 nodes are available: 1 Pod is not ready
for scheduling."
2023-04-21T16:34:42.689-0700 INFO cache/application.go:413 handle
app submission {"app": "applicationID: app1, queue: root.sandbox, partition:
default, totalNumOfTasks: 1, currentState: Submitted", "clusterID": "mycluster"}
2023-04-21T16:34:42.692-0700 INFO objects/application_state.go:132
Application state transition {"appID": "app1", "source": "New",
"destination": "Rejected", "event": "rejectApplication"}
2023-04-21T16:34:42.692-0700 ERROR scheduler/context.go:540 Failed
to add application to partition (placement rejected) {"applicationID":
"app1", "partitionName": "[mycluster]default", "error": "application 'app1'
rejected, cannot create queue 'root.sandbox' without placement rules"}
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateApplicationEvent
/Users/weih/go/src/github.pie.apple.com/apache/yunikorn-k8shim/vendor/github.com/apache/yunikorn-core/pkg/scheduler/context.go:540
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent
/Users/weih/go/src/github.pie.apple.com/apache/yunikorn-k8shim/vendor/github.com/apache/yunikorn-core/pkg/scheduler/scheduler.go:113
2023-04-21T16:34:42.693-0700 INFO cache/application.go:565 app is
rejected by scheduler {"appID": "app1"}
2023-04-21T16:34:42.693-0700 INFO cache/application.go:598
failApplication reason {"applicationID": "app1", "errMsg":
"ApplicationRejected: application 'app1' rejected, cannot create queue
'root.sandbox' without placement rules"}
2023-04-21T16:34:42.694-0700 INFO cache/application.go:585 setting
pod to failed {"podName": "pod-1"}
2023-04-21T16:34:42.712-0700 INFO general/general.go:179 task completes
{"appType": "general", "namespace": "default", "podName": "pod-1", "podUID":
"d643a5ad-c93b-4d99-8eac-9418fbac18b0", "podStatus": "Failed"}
2023-04-21T16:34:42.714-0700 INFO client/kubeclient.go:246
Successfully updated pod status {"namespace": "default", "podName": "pod-1",
"newStatus": "&PodStatus{Phase:Failed,Conditions:[]PodCondition{},Message:
application 'app1' rejected, cannot create queue 'root.sandbox' without
placement
rules,Reason:ApplicationRejected,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[]ContainerStatus{},QOSClass:,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},}"}
2023-04-21T16:34:42.714-0700 INFO cache/application.go:590 new pod
status {"status": "Failed"}
2023-04-21T16:34:42.714-0700 INFO cache/task.go:543 releasing
allocations {"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 0}
2023-04-21T16:34:42.714-0700 INFO cache/placeholder_manager.go:115
start to clean up app placeholders {"appID": "app1"}
2023-04-21T16:34:42.714-0700 INFO cache/placeholder_manager.go:128
finished cleaning up app placeholders {"appID": "app1"}
2023-04-21T16:34:42.714-0700 INFO scheduler/partition.go:1343 Invalid
ask release requested by shim {"appID": "app1", "ask":
"d643a5ad-c93b-4d99-8eac-9418fbac18b0", "terminationType":
"UNKNOWN_TERMINATION_TYPE"}
2023-04-21T16:34:42.714-0700 INFO cache/task_state.go:372 object
transition {"object": {}, "source": "New", "destination": "Completed",
"event": "CompleteTask"}
{code}
Then I deleted the pod, and noticed the log shows:
{code:bash}
2023-04-21T16:35:09.598-0700 INFO general/general.go:213 delete pod
{"appType": "general", "namespace": "default", "podName": "pod-1", "podUID":
"d643a5ad-c93b-4d99-8eac-9418fbac18b0"}
2023-04-21T16:35:09.598-0700 WARN cache/task.go:528 task allocation
UUID is empty, sending this release request to yunikorn-core could cause all
allocations of this app get released. skip this request, this may cause some
resource leak. check the logs for more info! {"applicationID": "app1",
"taskID": "d643a5ad-c93b-4d99-8eac-9418fbac18b0", "taskAlias": "default/pod-1",
"allocationUUID": "", "task": "Completed"}
{code}
Then if I recreated the same pod by just appending the queue label:
{code:bash}
queue: root.app1
{code}
The pod is still unschedulable and remains the status forever. And the only
solution to make it schedulable is to restart shim.
Is it a bug?
was:
I'm running a local dev env *make run_plugin* based on 1.2.0, no admission
controller is configured. Additionally, I configured a configmap in the default
namespace:
{code:bash}
apiVersion: v1
data:
queues.yaml: |
partitions:
- name: default
nodesortpolicy:
type: binpacking
queues:
- name: root
submitacl: '*'
queues:
- name: app1
submitacl: '*'
properties:
application.sort.policy: fifo
resources:
max:
{memory: 200G, vcore: 1}
kind: ConfigMap
metadata:
name: yunikorn-configs
{code}
Then I create a Pod with the following config:
{code:bash}
kind: Pod
apiVersion: v1
metadata:
name: pod-1
labels:
applicationId: "app1"
spec:
schedulerName: yunikorn
containers:
- name: pause
image: registry.k8s.io/pause:3.6
resources:
requests:
cpu: 1
{code}
The pod cannot be scheduled with a status *ApplicationRejected*, and I observed
log in the shim as:
{code:bash}
2023-04-21T16:34:42.354-0700 INFO cache/context.go:741 app added
{"appID": "app1"}
2023-04-21T16:34:42.354-0700 INFO cache/context.go:831 task added
{"appID": "app1", "taskID": "d643a5ad-c93b-4d99-8eac-9418fbac18b0",
"taskState": "New"}
2023-04-21T16:34:42.355-0700 INFO cache/context.go:841 app request
originating pod added {"appID": "app1", "original task":
"d643a5ad-c93b-4d99-8eac-9418fbac18b0"}
I0421 16:34:42.355111 46423 factory.go:344] "Unable to schedule pod; no fit;
waiting" pod="default/pod-1" err="0/1 nodes are available: 1 Pod is not ready
for scheduling."
2023-04-21T16:34:42.689-0700 INFO cache/application.go:413 handle
app submission {"app": "applicationID: app1, queue: root.sandbox, partition:
default, totalNumOfTasks: 1, currentState: Submitted", "clusterID": "mycluster"}
2023-04-21T16:34:42.692-0700 INFO objects/application_state.go:132
Application state transition {"appID": "app1", "source": "New",
"destination": "Rejected", "event": "rejectApplication"}
2023-04-21T16:34:42.692-0700 ERROR scheduler/context.go:540 Failed
to add application to partition (placement rejected) {"applicationID":
"app1", "partitionName": "[mycluster]default", "error": "application 'app1'
rejected, cannot create queue 'root.sandbox' without placement rules"}
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateApplicationEvent
/Users/weih/go/src/github.pie.apple.com/apache/yunikorn-k8shim/vendor/github.com/apache/yunikorn-core/pkg/scheduler/context.go:540
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent
/Users/weih/go/src/github.pie.apple.com/apache/yunikorn-k8shim/vendor/github.com/apache/yunikorn-core/pkg/scheduler/scheduler.go:113
2023-04-21T16:34:42.693-0700 INFO cache/application.go:565 app is
rejected by scheduler {"appID": "app1"}
2023-04-21T16:34:42.693-0700 INFO cache/application.go:598
failApplication reason {"applicationID": "app1", "errMsg":
"ApplicationRejected: application 'app1' rejected, cannot create queue
'root.sandbox' without placement rules"}
2023-04-21T16:34:42.694-0700 INFO cache/application.go:585 setting
pod to failed {"podName": "pod-1"}
2023-04-21T16:34:42.712-0700 INFO general/general.go:179 task completes
{"appType": "general", "namespace": "default", "podName": "pod-1", "podUID":
"d643a5ad-c93b-4d99-8eac-9418fbac18b0", "podStatus": "Failed"}
2023-04-21T16:34:42.714-0700 INFO client/kubeclient.go:246
Successfully updated pod status {"namespace": "default", "podName": "pod-1",
"newStatus": "&PodStatus{Phase:Failed,Conditions:[]PodCondition{},Message:
application 'app1' rejected, cannot create queue 'root.sandbox' without
placement
rules,Reason:ApplicationRejected,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[]ContainerStatus{},QOSClass:,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},}"}
2023-04-21T16:34:42.714-0700 INFO cache/application.go:590 new pod
status {"status": "Failed"}
2023-04-21T16:34:42.714-0700 INFO cache/task.go:543 releasing
allocations {"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 0}
2023-04-21T16:34:42.714-0700 INFO cache/placeholder_manager.go:115
start to clean up app placeholders {"appID": "app1"}
2023-04-21T16:34:42.714-0700 INFO cache/placeholder_manager.go:128
finished cleaning up app placeholders {"appID": "app1"}
2023-04-21T16:34:42.714-0700 INFO scheduler/partition.go:1343 Invalid
ask release requested by shim {"appID": "app1", "ask":
"d643a5ad-c93b-4d99-8eac-9418fbac18b0", "terminationType":
"UNKNOWN_TERMINATION_TYPE"}
2023-04-21T16:34:42.714-0700 INFO cache/task_state.go:372 object
transition {"object": {}, "source": "New", "destination": "Completed",
"event": "CompleteTask"}
{code}
Then I deleted the pod, and noticed the log shows:
{code:bash}
2023-04-21T16:35:09.598-0700 INFO general/general.go:213 delete pod
{"appType": "general", "namespace": "default", "podName": "pod-1", "podUID":
"d643a5ad-c93b-4d99-8eac-9418fbac18b0"}
2023-04-21T16:35:09.598-0700 WARN cache/task.go:528 task allocation
UUID is empty, sending this release request to yunikorn-core could cause all
allocations of this app get released. skip this request, this may cause some
resource leak. check the logs for more info! {"applicationID": "app1",
"taskID": "d643a5ad-c93b-4d99-8eac-9418fbac18b0", "taskAlias": "default/pod-1",
"allocationUUID": "", "task": "Completed"}
{code}
Then if I recreated the same pod by just appending the queue label:
{code:bash}
queue: root.app1
{code}
The pod is still unschedulable and remains the status forever. And the only
solution to make it schedulable is to restart shim.
Is it a bug?
> weird symptom when scheduling pod without specifying 'queue' label
> ------------------------------------------------------------------
>
> Key: YUNIKORN-1706
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1706
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: shim - kubernetes
> Reporter: Wei Huang
> Priority: Major
>
> I'm running a local dev env *make run_plugin* based on 1.2.0, no admission
> controller is configured. Additionally, I configured a configmap in the
> default namespace:
> {code:bash}
> apiVersion: v1
> data:
> queues.yaml: |
> partitions:
> - name: default
> nodesortpolicy:
> type: binpacking
> queues:
> - name: root
> submitacl: '*'
> queues:
> - name: app1
> submitacl: '*'
> properties:
> application.sort.policy: fifo
> resources:
> max:
> {memory: 200G, vcore: 1000}
> kind: ConfigMap
> metadata:
> name: yunikorn-configs
> {code}
> Then I create a Pod with the following config:
> {code:bash}
> kind: Pod
> apiVersion: v1
> metadata:
> name: pod-1
> labels:
> applicationId: "app1"
> spec:
> schedulerName: yunikorn
> containers:
> - name: pause
> image: registry.k8s.io/pause:3.6
> resources:
> requests:
> cpu: 1
> {code}
> The pod cannot be scheduled with a status {*}ApplicationRejected{*}, and I
> observed log in the shim as:
> {code:bash}
> 2023-04-21T16:34:42.354-0700 INFO cache/context.go:741 app added
> {"appID": "app1"}
> 2023-04-21T16:34:42.354-0700 INFO cache/context.go:831 task added
> {"appID": "app1", "taskID": "d643a5ad-c93b-4d99-8eac-9418fbac18b0",
> "taskState": "New"}
> 2023-04-21T16:34:42.355-0700 INFO cache/context.go:841 app request
> originating pod added {"appID": "app1", "original task":
> "d643a5ad-c93b-4d99-8eac-9418fbac18b0"}
> I0421 16:34:42.355111 46423 factory.go:344] "Unable to schedule pod; no
> fit; waiting" pod="default/pod-1" err="0/1 nodes are available: 1 Pod is not
> ready for scheduling."
> 2023-04-21T16:34:42.689-0700 INFO cache/application.go:413 handle
> app submission {"app": "applicationID: app1, queue: root.sandbox,
> partition: default, totalNumOfTasks: 1, currentState: Submitted",
> "clusterID": "mycluster"}
> 2023-04-21T16:34:42.692-0700 INFO objects/application_state.go:132
> Application state transition {"appID": "app1", "source": "New",
> "destination": "Rejected", "event": "rejectApplication"}
> 2023-04-21T16:34:42.692-0700 ERROR scheduler/context.go:540 Failed
> to add application to partition (placement rejected) {"applicationID":
> "app1", "partitionName": "[mycluster]default", "error": "application 'app1'
> rejected, cannot create queue 'root.sandbox' without placement rules"}
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateApplicationEvent
>
> /Users/weih/go/src/github.pie.apple.com/apache/yunikorn-k8shim/vendor/github.com/apache/yunikorn-core/pkg/scheduler/context.go:540
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent
>
> /Users/weih/go/src/github.pie.apple.com/apache/yunikorn-k8shim/vendor/github.com/apache/yunikorn-core/pkg/scheduler/scheduler.go:113
> 2023-04-21T16:34:42.693-0700 INFO cache/application.go:565 app is
> rejected by scheduler {"appID": "app1"}
> 2023-04-21T16:34:42.693-0700 INFO cache/application.go:598
> failApplication reason {"applicationID": "app1", "errMsg":
> "ApplicationRejected: application 'app1' rejected, cannot create queue
> 'root.sandbox' without placement rules"}
> 2023-04-21T16:34:42.694-0700 INFO cache/application.go:585 setting
> pod to failed {"podName": "pod-1"}
> 2023-04-21T16:34:42.712-0700 INFO general/general.go:179 task completes
> {"appType": "general", "namespace": "default", "podName": "pod-1", "podUID":
> "d643a5ad-c93b-4d99-8eac-9418fbac18b0", "podStatus": "Failed"}
> 2023-04-21T16:34:42.714-0700 INFO client/kubeclient.go:246
> Successfully updated pod status {"namespace": "default", "podName": "pod-1",
> "newStatus": "&PodStatus{Phase:Failed,Conditions:[]PodCondition{},Message:
> application 'app1' rejected, cannot create queue 'root.sandbox' without
> placement
> rules,Reason:ApplicationRejected,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[]ContainerStatus{},QOSClass:,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},}"}
> 2023-04-21T16:34:42.714-0700 INFO cache/application.go:590 new pod
> status {"status": "Failed"}
> 2023-04-21T16:34:42.714-0700 INFO cache/task.go:543 releasing
> allocations {"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 0}
> 2023-04-21T16:34:42.714-0700 INFO cache/placeholder_manager.go:115
> start to clean up app placeholders {"appID": "app1"}
> 2023-04-21T16:34:42.714-0700 INFO cache/placeholder_manager.go:128
> finished cleaning up app placeholders {"appID": "app1"}
> 2023-04-21T16:34:42.714-0700 INFO scheduler/partition.go:1343 Invalid
> ask release requested by shim {"appID": "app1", "ask":
> "d643a5ad-c93b-4d99-8eac-9418fbac18b0", "terminationType":
> "UNKNOWN_TERMINATION_TYPE"}
> 2023-04-21T16:34:42.714-0700 INFO cache/task_state.go:372 object
> transition {"object": {}, "source": "New", "destination": "Completed",
> "event": "CompleteTask"}
> {code}
> Then I deleted the pod, and noticed the log shows:
> {code:bash}
> 2023-04-21T16:35:09.598-0700 INFO general/general.go:213 delete pod
> {"appType": "general", "namespace": "default", "podName": "pod-1", "podUID":
> "d643a5ad-c93b-4d99-8eac-9418fbac18b0"}
> 2023-04-21T16:35:09.598-0700 WARN cache/task.go:528 task allocation
> UUID is empty, sending this release request to yunikorn-core could cause all
> allocations of this app get released. skip this request, this may cause some
> resource leak. check the logs for more info! {"applicationID": "app1",
> "taskID": "d643a5ad-c93b-4d99-8eac-9418fbac18b0", "taskAlias":
> "default/pod-1", "allocationUUID": "", "task": "Completed"}
> {code}
> Then if I recreated the same pod by just appending the queue label:
> {code:bash}
> queue: root.app1
> {code}
> The pod is still unschedulable and remains the status forever. And the only
> solution to make it schedulable is to restart shim.
> Is it a bug?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]