[
https://issues.apache.org/jira/browse/YUNIKORN-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850226#comment-17850226
]
Wilfred Spiegelenburg commented on YUNIKORN-2645:
-------------------------------------------------
thank you [~dimm] for the logs that helped.
The scheduler did not panic as it would have shown a restart of the scheduler.
It did log a message that should get your attention. If this happens your
cluster and the scheduler are in a really bad state. We can only detect this
and revert the changes but not fix it from the scheduler side. We keep on
scheduling.
A panic would be caused by the logger and expected when the logger runs in
development mode. This is all linked to the DPANIC level. We use
[DPANIC|https://pkg.go.dev/go.uber.org/zap#pkg-constants] in a couple of
places. What that level does it logs the error and then causes a panic if
running in development mode. If not running in development mode you just see
the message. The logger should never be running in development mode unless
running as part of unit tests etc.
If you see these messages with a DPANIC level in production you have a serious
issue.
Some background on the {{OutOfCpu}} message from the node: there has been a
change in K8s 1.22 kubelet to fix some resource issues. That introduced an
increased possibility of a race condition in the kubelet when scheduling short
lived pods or pods that did not pass the node admission checks. A mitigation
for that race condition was added in 1.22.4 but there is still complaints about
it [regularly happening|https://github.com/kubernetes/kubernetes/issues/115325]
even in the latest K8s versions with the default K8s scheduler. High pod churn,
node and deployment scaling all seem to be related and triggering. The sig_node
team has said that it is as good as it will get without causing the original
issue to come back. They assessed that the original issue was far worse than
this one.
> parent queue exceeds maximum resource
> -------------------------------------
>
> Key: YUNIKORN-2645
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2645
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Affects Versions: 1.5.1
> Reporter: Dmitry
> Priority: Major
> Attachments: yunikorn-logs.txt.gz
>
>
> We had a node broken in the cluster - kubernetes was creating pods which were
> immediately failing with "OutOfGPU" state. The node had 1000+ pods on it.
> The scheduler panicked with the log attached and was not scheduling any other
> pods.
> The config:
> {code:yaml}
> apiVersion: v1
> data:
> admissionController.filtering.bypassNamespaces:
> ^kube-system$,^rook$,^rook-east$,^rook-central$,^rook-pacific$,^rook-south-east$,^rook-system$
> queues.yaml: |
> partitions:
> - name: default
> placementrules:
> - name: fixed
> value: root.scavenging.osg
> create: true
> filter:
> type: allow
> users:
> - system:serviceaccount:osg-ligo:prp-htcondor-provisioner
> -
> system:serviceaccount:osg-opportunistic:prp-htcondor-provisioner
> - system:serviceaccount:osg-icecube:prp-htcondor-provisioner
> - name: tag
> value: namespace
> create: true
> parent:
> name: tag
> value: namespace.parentqueue
> - name: tag
> value: namespace
> create: true
> parent:
> name: fixed
> value: general
> nodesortpolicy:
> type: fair
> resourceweights:
> vcore: 1.0
> memory: 1.0
> nvidia.com/gpu: 4.0
> queues:
> - name: root
> submitacl: '*'
> properties:
> application.sort.policy: fair
> queues:
> - name: system
> parent: true
> properties:
> preemption.policy: disabled
> - name: general
> parent: true
> childtemplate:
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 100
> memory: 1Ti
> nvidia.com/gpu: 8
> max:
> vcore: 4000
> memory: 15Ti
> nvidia.com/gpu: 200
> - name: scavenging
> parent: true
> childtemplate:
> resources:
> guaranteed:
> vcore: 1
> memory: 1G
> nvidia.com/gpu: 1
> properties:
> priority.offset: "-10"
> - name: interactive
> parent: true
> childtemplate:
> resources:
> guaranteed:
> vcore: 1000
> memory: 10T
> nvidia.com/gpu: 48
> nvidia.com/a100: 4
> properties:
> priority.offset: "10"
> preemption.policy: disabled
> - name: clemson
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 256
> memory: 2T
> nvidia.com/gpu: 24
> - name: nysernet
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 1000
> memory: 5T
> nvidia.com/gpu: 16
> - name: gpn
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 5000
> memory: 50T
> nvidia.com/gpu: 256
> nvidia.com/a100: 16
> - name: sdsu
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 1000
> memory: 15T
> nvidia.com/gpu: 112
> nvidia.com/a100: 64
> queues:
> - name: sdsu-jupyterhub
> parent: false
> properties:
> preemption.policy: disabled
> priority.offset: "10"
> resources:
> guaranteed:
> vcore: 700
> memory: 5T
> nvidia.com/gpu: 100
> - name: tide
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 592
> memory: 15T
> nvidia.com/gpu: 72
> queues:
> - name: rook-tide
> parent: false
> properties:
> preemption.policy: disabled
> priority.offset: "10"
> resources:
> guaranteed:
> vcore: 500
> memory: 1T
> - name: ucsc
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 500
> memory: 4T
> nvidia.com/gpu: 256
> - name: ucsd
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 40000
> memory: 40T
> nvidia.com/gpu: 512
> nvidia.com/a100: 100
> queues:
> - name: ry
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 512
> memory: 8T
> nvidia.com/gpu: 144
> - name: suncave
> parent: false
> properties:
> preemption.policy: disabled
> priority.offset: "10"
> resources:
> guaranteed:
> vcore: 1000
> memory: 1T
> - name: dimm
> parent: false
> properties:
> preemption.policy: disabled
> priority.offset: "1000"
> resources:
> guaranteed:
> vcore: 1000
> memory: 1T
> - name: haosu
> parent: true
> properties:
> application.sort.policy: fair
> resources:
> guaranteed:
> vcore: 5000
> memory: 10T
> nvidia.com/gpu: 120
> queues:
> - name: rook-haosu
> parent: false
> properties:
> preemption.policy: disabled
> priority.offset: "10"
> resources:
> guaranteed:
> vcore: 1000
> memory: 1T
> kind: ConfigMap
> metadata:
> creationTimestamp: "2023-12-21T06:09:12Z"
> name: yunikorn-configs
> namespace: yunikorn
> resourceVersion: "7764804169"
> uid: 5b9b2c04-57af-4cab-84f8-b5f018952f9c
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]