[ 
https://issues.apache.org/jira/browse/YUNIKORN-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850562#comment-17850562
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2645:
-------------------------------------------------

The side effect of that broken node is that every single pod that we allocate 
will select that broken node. Based on the node sorting that node stays as the 
first node in the list to try. Every single pod gets placed but then fails to 
start. The node usage does not change and thus the node does not get pushed 
back in the list of available nodes. The scheduler due to that does not make 
any real progress.

I would consider that a hung scheduler but there is nothing that I think we can 
do about that without some major changes.

A possible solution would be for instance rate limit the number of pods we put 
on a node. Never schedule more than 10 pods per second on a node, including or 
ignoring failures, and when that is hit we skip the node. That could have made 
sure we try a couple of times and then try the next node. That could cause a 
slight delay when a cluster is almost full. It will also delay somewhat in an 
auto scaling cluster as the scheduler skips a node while the auto scaler does 
not...

> parent queue exceeds maximum resource
> -------------------------------------
>
>                 Key: YUNIKORN-2645
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2645
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 1.5.1
>            Reporter: Dmitry
>            Priority: Major
>         Attachments: yunikorn-logs.txt.gz
>
>
> We had a node broken in the cluster - kubernetes was creating pods which were 
> immediately failing with "OutOfGPU" state. The node had 1000+ pods on it.
> The scheduler panicked with the log attached and was not scheduling any other 
> pods.
> The config:
> {code:yaml}
> apiVersion: v1
> data:
>   admissionController.filtering.bypassNamespaces: 
> ^kube-system$,^rook$,^rook-east$,^rook-central$,^rook-pacific$,^rook-south-east$,^rook-system$
>   queues.yaml: |
>     partitions:
>       - name: default
>         placementrules:
>           - name: fixed
>             value: root.scavenging.osg
>             create: true
>             filter:
>               type: allow
>               users:
>               - system:serviceaccount:osg-ligo:prp-htcondor-provisioner
>               - 
> system:serviceaccount:osg-opportunistic:prp-htcondor-provisioner
>               - system:serviceaccount:osg-icecube:prp-htcondor-provisioner
>           - name: tag
>             value: namespace
>             create: true
>             parent:
>                name: tag
>                value: namespace.parentqueue
>           - name: tag
>             value: namespace
>             create: true
>             parent:
>                name: fixed
>                value: general
>         nodesortpolicy:
>           type: fair
>           resourceweights:
>             vcore: 1.0
>             memory: 1.0
>             nvidia.com/gpu: 4.0
>         queues:
>           - name: root
>             submitacl: '*'
>             properties:
>               application.sort.policy: fair
>             queues:
>             - name: system
>               parent: true
>               properties:
>                 preemption.policy: disabled
>             - name: general
>               parent: true
>               childtemplate:
>                 properties:
>                   application.sort.policy: fair
>                 resources:
>                   guaranteed:
>                     vcore: 100
>                     memory: 1Ti
>                     nvidia.com/gpu: 8
>                   max:
>                     vcore: 4000
>                     memory: 15Ti
>                     nvidia.com/gpu: 200
>             - name: scavenging
>               parent: true
>               childtemplate:
>                 resources:
>                   guaranteed:
>                     vcore: 1
>                     memory: 1G
>                     nvidia.com/gpu: 1
>                 properties:
>                   priority.offset: "-10"
>             - name: interactive
>               parent: true
>               childtemplate:
>                 resources:
>                   guaranteed:
>                     vcore: 1000
>                     memory: 10T
>                     nvidia.com/gpu: 48
>                     nvidia.com/a100: 4
>                 properties:
>                   priority.offset: "10"
>                   preemption.policy: disabled
>             - name: clemson
>               parent: true
>               properties:
>                 application.sort.policy: fair
>               resources:
>                 guaranteed:
>                   vcore: 256
>                   memory: 2T
>                   nvidia.com/gpu: 24
>             - name: nysernet
>               parent: true
>               properties:
>                 application.sort.policy: fair
>               resources:
>                 guaranteed:
>                   vcore: 1000
>                   memory: 5T
>                   nvidia.com/gpu: 16
>             - name: gpn
>               parent: true
>               properties:
>                 application.sort.policy: fair
>               resources:
>                 guaranteed:
>                   vcore: 5000
>                   memory: 50T
>                   nvidia.com/gpu: 256
>                   nvidia.com/a100: 16
>             - name: sdsu
>               parent: true
>               properties:
>                 application.sort.policy: fair
>               resources:
>                 guaranteed:
>                   vcore: 1000
>                   memory: 15T
>                   nvidia.com/gpu: 112
>                   nvidia.com/a100: 64
>               queues:
>               - name: sdsu-jupyterhub
>                 parent: false
>                 properties:
>                   preemption.policy: disabled
>                   priority.offset: "10"
>                 resources:
>                   guaranteed:
>                     vcore: 700
>                     memory: 5T
>                     nvidia.com/gpu: 100
>             - name: tide
>               parent: true
>               properties:
>                 application.sort.policy: fair
>               resources:
>                 guaranteed:
>                   vcore: 592
>                   memory: 15T
>                   nvidia.com/gpu: 72
>               queues:
>               - name: rook-tide
>                 parent: false
>                 properties:
>                   preemption.policy: disabled
>                   priority.offset: "10"
>                 resources:
>                   guaranteed:
>                     vcore: 500
>                     memory: 1T
>             - name: ucsc
>               parent: true
>               properties:
>                 application.sort.policy: fair
>               resources:
>                 guaranteed:
>                   vcore: 500
>                   memory: 4T
>                   nvidia.com/gpu: 256
>             - name: ucsd
>               parent: true
>               properties:
>                 application.sort.policy: fair
>               resources:
>                 guaranteed:
>                   vcore: 40000
>                   memory: 40T
>                   nvidia.com/gpu: 512
>                   nvidia.com/a100: 100
>               queues:
>               - name: ry
>                 parent: true
>                 properties:
>                   application.sort.policy: fair
>                 resources:
>                   guaranteed:
>                     vcore: 512
>                     memory: 8T
>                     nvidia.com/gpu: 144
>               - name: suncave
>                 parent: false
>                 properties:
>                   preemption.policy: disabled
>                   priority.offset: "10"
>                 resources:
>                   guaranteed:
>                     vcore: 1000
>                     memory: 1T
>               - name: dimm
>                 parent: false
>                 properties:
>                   preemption.policy: disabled
>                   priority.offset: "1000"
>                 resources:
>                   guaranteed:
>                     vcore: 1000
>                     memory: 1T
>               - name: haosu
>                 parent: true
>                 properties:
>                   application.sort.policy: fair
>                 resources:
>                   guaranteed:
>                     vcore: 5000
>                     memory: 10T
>                     nvidia.com/gpu: 120
>                 queues:
>                 - name: rook-haosu
>                   parent: false
>                   properties:
>                     preemption.policy: disabled
>                     priority.offset: "10"
>                   resources:
>                     guaranteed:
>                       vcore: 1000
>                       memory: 1T
> kind: ConfigMap
> metadata:
>   creationTimestamp: "2023-12-21T06:09:12Z"
>   name: yunikorn-configs
>   namespace: yunikorn
>   resourceVersion: "7764804169"
>   uid: 5b9b2c04-57af-4cab-84f8-b5f018952f9c
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to