Dmitry created YUNIKORN-2645:
--------------------------------
Summary: parent queue exceeds maximum resource
Key: YUNIKORN-2645
URL: https://issues.apache.org/jira/browse/YUNIKORN-2645
Project: Apache YuniKorn
Issue Type: Bug
Components: core - scheduler
Affects Versions: 1.5.1
Reporter: Dmitry
Attachments: yunikorn-logs.txt.gz
We had a node broken in the cluster - kubernetes was creating pods which were
immediately failing with "OutOfGPU" state. The node had 1000+ pods on it.
The scheduler panicked with the log attached and was not scheduling any other
pods.
The config:
```
apiVersion: v1
data:
admissionController.filtering.bypassNamespaces:
^kube-system$,^rook$,^rook-east$,^rook-central$,^rook-pacific$,^rook-south-east$,^rook-system$
queues.yaml: |
partitions:
- name: default
placementrules:
- name: fixed
value: root.scavenging.osg
create: true
filter:
type: allow
users:
- system:serviceaccount:osg-ligo:prp-htcondor-provisioner
- system:serviceaccount:osg-opportunistic:prp-htcondor-provisioner
- system:serviceaccount:osg-icecube:prp-htcondor-provisioner
- name: tag
value: namespace
create: true
parent:
name: tag
value: namespace.parentqueue
- name: tag
value: namespace
create: true
parent:
name: fixed
value: general
nodesortpolicy:
type: fair
resourceweights:
vcore: 1.0
memory: 1.0
nvidia.com/gpu: 4.0
queues:
- name: root
submitacl: '*'
properties:
application.sort.policy: fair
queues:
- name: system
parent: true
properties:
preemption.policy: disabled
- name: general
parent: true
childtemplate:
properties:
application.sort.policy: fair
resources:
guaranteed:
vcore: 100
memory: 1Ti
nvidia.com/gpu: 8
max:
vcore: 4000
memory: 15Ti
nvidia.com/gpu: 200
- name: scavenging
parent: true
childtemplate:
resources:
guaranteed:
vcore: 1
memory: 1G
nvidia.com/gpu: 1
properties:
priority.offset: "-10"
- name: interactive
parent: true
childtemplate:
resources:
guaranteed:
vcore: 1000
memory: 10T
nvidia.com/gpu: 48
nvidia.com/a100: 4
properties:
priority.offset: "10"
preemption.policy: disabled
- name: clemson
parent: true
properties:
application.sort.policy: fair
resources:
guaranteed:
vcore: 256
memory: 2T
nvidia.com/gpu: 24
- name: nysernet
parent: true
properties:
application.sort.policy: fair
resources:
guaranteed:
vcore: 1000
memory: 5T
nvidia.com/gpu: 16
- name: gpn
parent: true
properties:
application.sort.policy: fair
resources:
guaranteed:
vcore: 5000
memory: 50T
nvidia.com/gpu: 256
nvidia.com/a100: 16
- name: sdsu
parent: true
properties:
application.sort.policy: fair
resources:
guaranteed:
vcore: 1000
memory: 15T
nvidia.com/gpu: 112
nvidia.com/a100: 64
queues:
- name: sdsu-jupyterhub
parent: false
properties:
preemption.policy: disabled
priority.offset: "10"
resources:
guaranteed:
vcore: 700
memory: 5T
nvidia.com/gpu: 100
- name: tide
parent: true
properties:
application.sort.policy: fair
resources:
guaranteed:
vcore: 592
memory: 15T
nvidia.com/gpu: 72
queues:
- name: rook-tide
parent: false
properties:
preemption.policy: disabled
priority.offset: "10"
resources:
guaranteed:
vcore: 500
memory: 1T
- name: ucsc
parent: true
properties:
application.sort.policy: fair
resources:
guaranteed:
vcore: 500
memory: 4T
nvidia.com/gpu: 256
- name: ucsd
parent: true
properties:
application.sort.policy: fair
resources:
guaranteed:
vcore: 40000
memory: 40T
nvidia.com/gpu: 512
nvidia.com/a100: 100
queues:
- name: ry
parent: true
properties:
application.sort.policy: fair
resources:
guaranteed:
vcore: 512
memory: 8T
nvidia.com/gpu: 144
- name: suncave
parent: false
properties:
preemption.policy: disabled
priority.offset: "10"
resources:
guaranteed:
vcore: 1000
memory: 1T
- name: dimm
parent: false
properties:
preemption.policy: disabled
priority.offset: "1000"
resources:
guaranteed:
vcore: 1000
memory: 1T
- name: haosu
parent: true
properties:
application.sort.policy: fair
resources:
guaranteed:
vcore: 5000
memory: 10T
nvidia.com/gpu: 120
queues:
- name: rook-haosu
parent: false
properties:
preemption.policy: disabled
priority.offset: "10"
resources:
guaranteed:
vcore: 1000
memory: 1T
kind: ConfigMap
metadata:
creationTimestamp: "2023-12-21T06:09:12Z"
name: yunikorn-configs
namespace: yunikorn
resourceVersion: "7764804169"
uid: 5b9b2c04-57af-4cab-84f8-b5f018952f9c
```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]