[jira] [Updated] (YUNIKORN-1706) weird symptom when scheduling pod without specifying 'queue' label

Wei Huang (Jira) Fri, 21 Apr 2023 17:28:08 -0700


     [ 
https://issues.apache.org/jira/browse/YUNIKORN-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wei Huang updated YUNIKORN-1706:
--------------------------------
    Description: 
I'm running a local dev env *make run_plugin* based on 1.2.0, no admission 
controller is configured. Additionally, I configured a configmap in the default 
namespace:
{code:bash}
apiVersion: v1
data:
  queues.yaml: |
    partitions:
    - name: default
      nodesortpolicy:
        type: binpacking
      queues:
      - name: root
        submitacl: '*'
        queues:
        - name: app1
          submitacl: '*'
          properties:
            application.sort.policy: fifo
          resources:
            max:
              {memory: 200G, vcore: 1000}
kind: ConfigMap
metadata:
  name: yunikorn-configs
{code}
Then I create a Pod with the following config:
{code:bash}
kind: Pod
apiVersion: v1
metadata:
  name: pod-1
  labels:
    applicationId: "app1"
spec:
  schedulerName: yunikorn
  containers:
  - name: pause
    image: registry.k8s.io/pause:3.6
    resources:
      requests:
       cpu: 1
{code}
The pod cannot be scheduled with a status {*}ApplicationRejected{*}, and I 
observed log in the shim as:
{code:bash}
2023-04-21T16:34:42.354-0700    INFO    cache/context.go:741    app added       
{"appID": "app1"}
2023-04-21T16:34:42.354-0700    INFO    cache/context.go:831    task added      
{"appID": "app1", "taskID": "d643a5ad-c93b-4d99-8eac-9418fbac18b0", 
"taskState": "New"}
2023-04-21T16:34:42.355-0700    INFO    cache/context.go:841    app request 
originating pod added       {"appID": "app1", "original task": 
"d643a5ad-c93b-4d99-8eac-9418fbac18b0"}
I0421 16:34:42.355111   46423 factory.go:344] "Unable to schedule pod; no fit; 
waiting" pod="default/pod-1" err="0/1 nodes are available: 1 Pod is not ready 
for scheduling."
2023-04-21T16:34:42.689-0700    INFO    cache/application.go:413        handle 
app submission   {"app": "applicationID: app1, queue: root.sandbox, partition: 
default, totalNumOfTasks: 1, currentState: Submitted", "clusterID": "mycluster"}
2023-04-21T16:34:42.692-0700    INFO    objects/application_state.go:132        
Application state transition    {"appID": "app1", "source": "New", 
"destination": "Rejected", "event": "rejectApplication"}
2023-04-21T16:34:42.692-0700    ERROR   scheduler/context.go:540        Failed 
to add application to partition (placement rejected)     {"applicationID": 
"app1", "partitionName": "[mycluster]default", "error": "application 'app1' 
rejected, cannot create queue 'root.sandbox' without placement rules"}
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateApplicationEvent
        
/Users/weih/go/src/github.pie.apple.com/apache/yunikorn-k8shim/vendor/github.com/apache/yunikorn-core/pkg/scheduler/context.go:540
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent
        
/Users/weih/go/src/github.pie.apple.com/apache/yunikorn-k8shim/vendor/github.com/apache/yunikorn-core/pkg/scheduler/scheduler.go:113
2023-04-21T16:34:42.693-0700    INFO    cache/application.go:565        app is 
rejected by scheduler    {"appID": "app1"}
2023-04-21T16:34:42.693-0700    INFO    cache/application.go:598        
failApplication reason  {"applicationID": "app1", "errMsg": 
"ApplicationRejected: application 'app1' rejected, cannot create queue 
'root.sandbox' without placement rules"}
2023-04-21T16:34:42.694-0700    INFO    cache/application.go:585        setting 
pod to failed   {"podName": "pod-1"}
2023-04-21T16:34:42.712-0700    INFO    general/general.go:179  task completes  
{"appType": "general", "namespace": "default", "podName": "pod-1", "podUID": 
"d643a5ad-c93b-4d99-8eac-9418fbac18b0", "podStatus": "Failed"}
2023-04-21T16:34:42.714-0700    INFO    client/kubeclient.go:246        
Successfully updated pod status {"namespace": "default", "podName": "pod-1", 
"newStatus": "&PodStatus{Phase:Failed,Conditions:[]PodCondition{},Message: 
application 'app1' rejected, cannot create queue 'root.sandbox' without 
placement 
rules,Reason:ApplicationRejected,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[]ContainerStatus{},QOSClass:,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},}"}
2023-04-21T16:34:42.714-0700    INFO    cache/application.go:590        new pod 
status  {"status": "Failed"}
2023-04-21T16:34:42.714-0700    INFO    cache/task.go:543       releasing 
allocations   {"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 0}
2023-04-21T16:34:42.714-0700    INFO    cache/placeholder_manager.go:115        
start to clean up app placeholders      {"appID": "app1"}
2023-04-21T16:34:42.714-0700    INFO    cache/placeholder_manager.go:128        
finished cleaning up app placeholders   {"appID": "app1"}
2023-04-21T16:34:42.714-0700    INFO    scheduler/partition.go:1343     Invalid 
ask release requested by shim   {"appID": "app1", "ask": 
"d643a5ad-c93b-4d99-8eac-9418fbac18b0", "terminationType": 
"UNKNOWN_TERMINATION_TYPE"}
2023-04-21T16:34:42.714-0700    INFO    cache/task_state.go:372 object 
transition       {"object": {}, "source": "New", "destination": "Completed", 
"event": "CompleteTask"} 
{code}
Then I deleted the pod, and noticed the log shows:
{code:bash}
2023-04-21T16:35:09.598-0700    INFO    general/general.go:213  delete pod      
{"appType": "general", "namespace": "default", "podName": "pod-1", "podUID": 
"d643a5ad-c93b-4d99-8eac-9418fbac18b0"}
2023-04-21T16:35:09.598-0700    WARN    cache/task.go:528       task allocation 
UUID is empty, sending this release request to yunikorn-core could cause all 
allocations of this app get released. skip this request, this may cause some 
resource leak. check the logs for more info!  {"applicationID": "app1", 
"taskID": "d643a5ad-c93b-4d99-8eac-9418fbac18b0", "taskAlias": "default/pod-1", 
"allocationUUID": "", "task": "Completed"}
{code}
Then if I recreated the same pod by just appending the queue label:
{code:bash}
queue: root.app1
{code}
The pod is still unschedulable and remains the status forever. And the only 
solution to make it schedulable is to restart shim.

Is it a bug?

  was:
I'm running a local dev env *make run_plugin* based on 1.2.0, no admission 
controller is configured. Additionally, I configured a configmap in the default 
namespace:


{code:bash}
apiVersion: v1
data:
  queues.yaml: |
    partitions:
    - name: default
      nodesortpolicy:
        type: binpacking
      queues:
      - name: root
        submitacl: '*'
        queues:
        - name: app1
          submitacl: '*'
          properties:
            application.sort.policy: fifo
          resources:
            max:
              {memory: 200G, vcore: 1}
kind: ConfigMap
metadata:
  name: yunikorn-configs
{code}

Then I create a Pod with the following config:

{code:bash}
kind: Pod
apiVersion: v1
metadata:
  name: pod-1
  labels:
    applicationId: "app1"
spec:
  schedulerName: yunikorn
  containers:
  - name: pause
    image: registry.k8s.io/pause:3.6
    resources:
      requests:
       cpu: 1
{code}


The pod cannot be scheduled with a status *ApplicationRejected*, and I observed 
log in the shim as:

{code:bash}
2023-04-21T16:34:42.354-0700    INFO    cache/context.go:741    app added       
{"appID": "app1"}
2023-04-21T16:34:42.354-0700    INFO    cache/context.go:831    task added      
{"appID": "app1", "taskID": "d643a5ad-c93b-4d99-8eac-9418fbac18b0", 
"taskState": "New"}
2023-04-21T16:34:42.355-0700    INFO    cache/context.go:841    app request 
originating pod added       {"appID": "app1", "original task": 
"d643a5ad-c93b-4d99-8eac-9418fbac18b0"}
I0421 16:34:42.355111   46423 factory.go:344] "Unable to schedule pod; no fit; 
waiting" pod="default/pod-1" err="0/1 nodes are available: 1 Pod is not ready 
for scheduling."
2023-04-21T16:34:42.689-0700    INFO    cache/application.go:413        handle 
app submission   {"app": "applicationID: app1, queue: root.sandbox, partition: 
default, totalNumOfTasks: 1, currentState: Submitted", "clusterID": "mycluster"}
2023-04-21T16:34:42.692-0700    INFO    objects/application_state.go:132        
Application state transition    {"appID": "app1", "source": "New", 
"destination": "Rejected", "event": "rejectApplication"}
2023-04-21T16:34:42.692-0700    ERROR   scheduler/context.go:540        Failed 
to add application to partition (placement rejected)     {"applicationID": 
"app1", "partitionName": "[mycluster]default", "error": "application 'app1' 
rejected, cannot create queue 'root.sandbox' without placement rules"}
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateApplicationEvent
        
/Users/weih/go/src/github.pie.apple.com/apache/yunikorn-k8shim/vendor/github.com/apache/yunikorn-core/pkg/scheduler/context.go:540
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent
        
/Users/weih/go/src/github.pie.apple.com/apache/yunikorn-k8shim/vendor/github.com/apache/yunikorn-core/pkg/scheduler/scheduler.go:113
2023-04-21T16:34:42.693-0700    INFO    cache/application.go:565        app is 
rejected by scheduler    {"appID": "app1"}
2023-04-21T16:34:42.693-0700    INFO    cache/application.go:598        
failApplication reason  {"applicationID": "app1", "errMsg": 
"ApplicationRejected: application 'app1' rejected, cannot create queue 
'root.sandbox' without placement rules"}
2023-04-21T16:34:42.694-0700    INFO    cache/application.go:585        setting 
pod to failed   {"podName": "pod-1"}
2023-04-21T16:34:42.712-0700    INFO    general/general.go:179  task completes  
{"appType": "general", "namespace": "default", "podName": "pod-1", "podUID": 
"d643a5ad-c93b-4d99-8eac-9418fbac18b0", "podStatus": "Failed"}
2023-04-21T16:34:42.714-0700    INFO    client/kubeclient.go:246        
Successfully updated pod status {"namespace": "default", "podName": "pod-1", 
"newStatus": "&PodStatus{Phase:Failed,Conditions:[]PodCondition{},Message: 
application 'app1' rejected, cannot create queue 'root.sandbox' without 
placement 
rules,Reason:ApplicationRejected,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[]ContainerStatus{},QOSClass:,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},}"}
2023-04-21T16:34:42.714-0700    INFO    cache/application.go:590        new pod 
status  {"status": "Failed"}
2023-04-21T16:34:42.714-0700    INFO    cache/task.go:543       releasing 
allocations   {"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 0}
2023-04-21T16:34:42.714-0700    INFO    cache/placeholder_manager.go:115        
start to clean up app placeholders      {"appID": "app1"}
2023-04-21T16:34:42.714-0700    INFO    cache/placeholder_manager.go:128        
finished cleaning up app placeholders   {"appID": "app1"}
2023-04-21T16:34:42.714-0700    INFO    scheduler/partition.go:1343     Invalid 
ask release requested by shim   {"appID": "app1", "ask": 
"d643a5ad-c93b-4d99-8eac-9418fbac18b0", "terminationType": 
"UNKNOWN_TERMINATION_TYPE"}
2023-04-21T16:34:42.714-0700    INFO    cache/task_state.go:372 object 
transition       {"object": {}, "source": "New", "destination": "Completed", 
"event": "CompleteTask"} 
{code}

Then I deleted the pod, and noticed the log shows:

{code:bash}
2023-04-21T16:35:09.598-0700    INFO    general/general.go:213  delete pod      
{"appType": "general", "namespace": "default", "podName": "pod-1", "podUID": 
"d643a5ad-c93b-4d99-8eac-9418fbac18b0"}
2023-04-21T16:35:09.598-0700    WARN    cache/task.go:528       task allocation 
UUID is empty, sending this release request to yunikorn-core could cause all 
allocations of this app get released. skip this request, this may cause some 
resource leak. check the logs for more info!  {"applicationID": "app1", 
"taskID": "d643a5ad-c93b-4d99-8eac-9418fbac18b0", "taskAlias": "default/pod-1", 
"allocationUUID": "", "task": "Completed"}
{code}

Then if I recreated the same pod by just appending the queue label:

{code:bash}
queue: root.app1
{code}

The pod is still unschedulable and remains the status forever. And the only 
solution to make it schedulable is to restart shim.

Is it a bug?


> weird symptom when scheduling pod without specifying 'queue' label
> ------------------------------------------------------------------
>
>                 Key: YUNIKORN-1706
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1706
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Wei Huang
>            Priority: Major
>
> I'm running a local dev env *make run_plugin* based on 1.2.0, no admission 
> controller is configured. Additionally, I configured a configmap in the 
> default namespace:
> {code:bash}
> apiVersion: v1
> data:
>   queues.yaml: |
>     partitions:
>     - name: default
>       nodesortpolicy:
>         type: binpacking
>       queues:
>       - name: root
>         submitacl: '*'
>         queues:
>         - name: app1
>           submitacl: '*'
>           properties:
>             application.sort.policy: fifo
>           resources:
>             max:
>               {memory: 200G, vcore: 1000}
> kind: ConfigMap
> metadata:
>   name: yunikorn-configs
> {code}
> Then I create a Pod with the following config:
> {code:bash}
> kind: Pod
> apiVersion: v1
> metadata:
>   name: pod-1
>   labels:
>     applicationId: "app1"
> spec:
>   schedulerName: yunikorn
>   containers:
>   - name: pause
>     image: registry.k8s.io/pause:3.6
>     resources:
>       requests:
>        cpu: 1
> {code}
> The pod cannot be scheduled with a status {*}ApplicationRejected{*}, and I 
> observed log in the shim as:
> {code:bash}
> 2023-04-21T16:34:42.354-0700  INFO    cache/context.go:741    app added       
> {"appID": "app1"}
> 2023-04-21T16:34:42.354-0700  INFO    cache/context.go:831    task added      
> {"appID": "app1", "taskID": "d643a5ad-c93b-4d99-8eac-9418fbac18b0", 
> "taskState": "New"}
> 2023-04-21T16:34:42.355-0700  INFO    cache/context.go:841    app request 
> originating pod added       {"appID": "app1", "original task": 
> "d643a5ad-c93b-4d99-8eac-9418fbac18b0"}
> I0421 16:34:42.355111   46423 factory.go:344] "Unable to schedule pod; no 
> fit; waiting" pod="default/pod-1" err="0/1 nodes are available: 1 Pod is not 
> ready for scheduling."
> 2023-04-21T16:34:42.689-0700  INFO    cache/application.go:413        handle 
> app submission   {"app": "applicationID: app1, queue: root.sandbox, 
> partition: default, totalNumOfTasks: 1, currentState: Submitted", 
> "clusterID": "mycluster"}
> 2023-04-21T16:34:42.692-0700  INFO    objects/application_state.go:132        
> Application state transition    {"appID": "app1", "source": "New", 
> "destination": "Rejected", "event": "rejectApplication"}
> 2023-04-21T16:34:42.692-0700  ERROR   scheduler/context.go:540        Failed 
> to add application to partition (placement rejected)     {"applicationID": 
> "app1", "partitionName": "[mycluster]default", "error": "application 'app1' 
> rejected, cannot create queue 'root.sandbox' without placement rules"}
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateApplicationEvent
>       
> /Users/weih/go/src/github.pie.apple.com/apache/yunikorn-k8shim/vendor/github.com/apache/yunikorn-core/pkg/scheduler/context.go:540
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent
>       
> /Users/weih/go/src/github.pie.apple.com/apache/yunikorn-k8shim/vendor/github.com/apache/yunikorn-core/pkg/scheduler/scheduler.go:113
> 2023-04-21T16:34:42.693-0700  INFO    cache/application.go:565        app is 
> rejected by scheduler    {"appID": "app1"}
> 2023-04-21T16:34:42.693-0700  INFO    cache/application.go:598        
> failApplication reason  {"applicationID": "app1", "errMsg": 
> "ApplicationRejected: application 'app1' rejected, cannot create queue 
> 'root.sandbox' without placement rules"}
> 2023-04-21T16:34:42.694-0700  INFO    cache/application.go:585        setting 
> pod to failed   {"podName": "pod-1"}
> 2023-04-21T16:34:42.712-0700  INFO    general/general.go:179  task completes  
> {"appType": "general", "namespace": "default", "podName": "pod-1", "podUID": 
> "d643a5ad-c93b-4d99-8eac-9418fbac18b0", "podStatus": "Failed"}
> 2023-04-21T16:34:42.714-0700  INFO    client/kubeclient.go:246        
> Successfully updated pod status {"namespace": "default", "podName": "pod-1", 
> "newStatus": "&PodStatus{Phase:Failed,Conditions:[]PodCondition{},Message: 
> application 'app1' rejected, cannot create queue 'root.sandbox' without 
> placement 
> rules,Reason:ApplicationRejected,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[]ContainerStatus{},QOSClass:,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},}"}
> 2023-04-21T16:34:42.714-0700  INFO    cache/application.go:590        new pod 
> status  {"status": "Failed"}
> 2023-04-21T16:34:42.714-0700  INFO    cache/task.go:543       releasing 
> allocations   {"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 0}
> 2023-04-21T16:34:42.714-0700  INFO    cache/placeholder_manager.go:115        
> start to clean up app placeholders      {"appID": "app1"}
> 2023-04-21T16:34:42.714-0700  INFO    cache/placeholder_manager.go:128        
> finished cleaning up app placeholders   {"appID": "app1"}
> 2023-04-21T16:34:42.714-0700  INFO    scheduler/partition.go:1343     Invalid 
> ask release requested by shim   {"appID": "app1", "ask": 
> "d643a5ad-c93b-4d99-8eac-9418fbac18b0", "terminationType": 
> "UNKNOWN_TERMINATION_TYPE"}
> 2023-04-21T16:34:42.714-0700  INFO    cache/task_state.go:372 object 
> transition       {"object": {}, "source": "New", "destination": "Completed", 
> "event": "CompleteTask"} 
> {code}
> Then I deleted the pod, and noticed the log shows:
> {code:bash}
> 2023-04-21T16:35:09.598-0700  INFO    general/general.go:213  delete pod      
> {"appType": "general", "namespace": "default", "podName": "pod-1", "podUID": 
> "d643a5ad-c93b-4d99-8eac-9418fbac18b0"}
> 2023-04-21T16:35:09.598-0700  WARN    cache/task.go:528       task allocation 
> UUID is empty, sending this release request to yunikorn-core could cause all 
> allocations of this app get released. skip this request, this may cause some 
> resource leak. check the logs for more info!  {"applicationID": "app1", 
> "taskID": "d643a5ad-c93b-4d99-8eac-9418fbac18b0", "taskAlias": 
> "default/pod-1", "allocationUUID": "", "task": "Completed"}
> {code}
> Then if I recreated the same pod by just appending the queue label:
> {code:bash}
> queue: root.app1
> {code}
> The pod is still unschedulable and remains the status forever. And the only 
> solution to make it schedulable is to restart shim.
> Is it a bug?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YUNIKORN-1706) weird symptom when scheduling pod without specifying 'queue' label

Reply via email to