[
https://issues.apache.org/jira/browse/YUNIKORN-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Bacsko updated YUNIKORN-2645:
-----------------------------------
Target Version: 1.7.0 (was: 1.6.0)
> Rate limit pod allocations on nodes
> -----------------------------------
>
> Key: YUNIKORN-2645
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2645
> Project: Apache YuniKorn
> Issue Type: New Feature
> Components: core - scheduler
> Affects Versions: 1.5.1
> Reporter: Dmitry
> Assignee: Qi Zhu
> Priority: Major
> Attachments: yunikorn-logs.txt.gz
>
>
> We had a node broken in the cluster - kubernetes was creating pods which were
> immediately failing with "OutOfGPU" state. The node had 1000+ pods on it.
> The scheduler panicked with the log attached and was not scheduling any other
> pods.
> The config:
> {code:yaml}
> apiVersion: v1
> data:
> admissionController.filtering.bypassNamespaces:
> ^kube-system$,^rook$,^rook-east$,^rook-central$,^rook-pacific$,^rook-south-east$,^rook-system$
> queues.yaml: |
> partitions:
> - name: default
> placementrules:
> - name: fixed
> value: root.scavenging.osg
> create: true
> filter:
> type: allow
> users:
> - system:serviceaccount:osg-ligo:prp-htcondor-provisioner
> -
> system:serviceaccount:osg-opportunistic:prp-htcondor-provisioner
> - system:serviceaccount:osg-icecube:prp-htcondor-provisioner
> - name: tag
> value: namespace
> create: true
> parent:
> name: tag
> value: namespace.parentqueue
> - name: tag
> value: namespace
> create: true
> parent:
> name: fixed
> value: general
> nodesortpolicy:
> type: fair
> resourceweights:
> vcore: 1.0
> memory: 1.0
> nvidia.com/gpu: 4.0
> queues:
> - name: root
> submitacl: '*'
> properties:
> application.sort.policy: fair
> queues:
> - name: system
> parent: true
> properties:
> preemption.policy: disabled
> - name: general
> parent: true
> childtemplate:
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 100
> memory: 1Ti
> nvidia.com/gpu: 8
> max:
> vcore: 4000
> memory: 15Ti
> nvidia.com/gpu: 200
> - name: scavenging
> parent: true
> childtemplate:
> resources:
> guaranteed:
> vcore: 1
> memory: 1G
> nvidia.com/gpu: 1
> properties:
> priority.offset: "-10"
> - name: interactive
> parent: true
> childtemplate:
> resources:
> guaranteed:
> vcore: 1000
> memory: 10T
> nvidia.com/gpu: 48
> nvidia.com/a100: 4
> properties:
> priority.offset: "10"
> preemption.policy: disabled
> - name: clemson
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 256
> memory: 2T
> nvidia.com/gpu: 24
> - name: nysernet
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 1000
> memory: 5T
> nvidia.com/gpu: 16
> - name: gpn
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 5000
> memory: 50T
> nvidia.com/gpu: 256
> nvidia.com/a100: 16
> - name: sdsu
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 1000
> memory: 15T
> nvidia.com/gpu: 112
> nvidia.com/a100: 64
> queues:
> - name: sdsu-jupyterhub
> parent: false
> properties:
> preemption.policy: disabled
> priority.offset: "10"
> resources:
> guaranteed:
> vcore: 700
> memory: 5T
> nvidia.com/gpu: 100
> - name: tide
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 592
> memory: 15T
> nvidia.com/gpu: 72
> queues:
> - name: rook-tide
> parent: false
> properties:
> preemption.policy: disabled
> priority.offset: "10"
> resources:
> guaranteed:
> vcore: 500
> memory: 1T
> - name: ucsc
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 500
> memory: 4T
> nvidia.com/gpu: 256
> - name: ucsd
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 40000
> memory: 40T
> nvidia.com/gpu: 512
> nvidia.com/a100: 100
> queues:
> - name: ry
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 512
> memory: 8T
> nvidia.com/gpu: 144
> - name: suncave
> parent: false
> properties:
> preemption.policy: disabled
> priority.offset: "10"
> resources:
> guaranteed:
> vcore: 1000
> memory: 1T
> - name: dimm
> parent: false
> properties:
> preemption.policy: disabled
> priority.offset: "1000"
> resources:
> guaranteed:
> vcore: 1000
> memory: 1T
> - name: haosu
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 5000
> memory: 10T
> nvidia.com/gpu: 120
> queues:
> - name: rook-haosu
> parent: false
> properties:
> preemption.policy: disabled
> priority.offset: "10"
> resources:
> guaranteed:
> vcore: 1000
> memory: 1T
> kind: ConfigMap
> metadata:
> creationTimestamp: "2023-12-21T06:09:12Z"
> name: yunikorn-configs
> namespace: yunikorn
> resourceVersion: "7764804169"
> uid: 5b9b2c04-57af-4cab-84f8-b5f018952f9c
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]