[
https://issues.apache.org/jira/browse/YUNIKORN-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807869#comment-17807869
]
Yongjun Zhang edited comment on YUNIKORN-2030 at 1/17/24 9:24 PM:
------------------------------------------------------------------
Shared with Wilfred earlier that YUNIKORN-1993 was not the root cause of our
problem, we have it included it in our build for quite long now.
Here is a stack trace I got:
{code:java}
WARN objects/application.go:1506 queue update failed unexpectedly
{"error": "allocation (map[memory:25521291264 pods:1 vcore:2000]) puts queue
'root.x.y.z' over maximum allocation (map[memory:3328725483520 vcore:393976]),
current usage (map[memory:3320344215552 pods:131 vcore:262000])"}{code}
{code:java}
created by
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/scheduler.go:66
+0x21c
goroutine 123 [running]:
runtime/debug.Stack()
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/.go/src/runtime/debug/stack.go:24
+0x64
runtime/debug.PrintStack()
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/.go/src/runtime/debug/stack.go:16
+0x1c
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNode(0x4058d6a960,
0x40449b3970, 0x40aa3e49a0)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/application.go:1508
+0x210
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNodesNoReserve(0x402342d080?,
0x40aa3e49a0, {0x19c78f0, 0x406a71fa40}, {0x4004b34060, 0x1e})
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/application.go:1382
+0xd4
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryReservedAllocate(0x4058d6a960,
0x40a349d7a8, 0x402342dd98)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/application.go:1305
+0x3a0
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x4024554d80,
0x4052ac3b00?)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1378
+0x304
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x4024554c00,
0x4052ac3b00?)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1397
+0xd0
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x400c92f200,
0x24820?)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1397
+0xd0
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x400c92ec00,
0x0?)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1397
+0xd0
github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).tryReservedAllocate(0x402375a6c0)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/partition.go:814
+0x90
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).schedule(0x402342df88?)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/context.go:131
+0xf4
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).internalSchedule(0x40003bda70)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/scheduler.go:75
+0x7c
created by
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/scheduler.go:66
+0x21c {code}
was (Author: yzhangal):
Here is a stack trace I got:
{code:java}
WARN objects/application.go:1506 queue update failed unexpectedly
{"error": "allocation (map[memory:25521291264 pods:1 vcore:2000]) puts queue
'root.x.y.z' over maximum allocation (map[memory:3328725483520 vcore:393976]),
current usage (map[memory:3320344215552 pods:131 vcore:262000])"}{code}
{code:java}
created by
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/scheduler.go:66
+0x21c
goroutine 123 [running]:
runtime/debug.Stack()
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/.go/src/runtime/debug/stack.go:24
+0x64
runtime/debug.PrintStack()
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/.go/src/runtime/debug/stack.go:16
+0x1c
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNode(0x4058d6a960,
0x40449b3970, 0x40aa3e49a0)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/application.go:1508
+0x210
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNodesNoReserve(0x402342d080?,
0x40aa3e49a0, {0x19c78f0, 0x406a71fa40}, {0x4004b34060, 0x1e})
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/application.go:1382
+0xd4
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryReservedAllocate(0x4058d6a960,
0x40a349d7a8, 0x402342dd98)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/application.go:1305
+0x3a0
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x4024554d80,
0x4052ac3b00?)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1378
+0x304
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x4024554c00,
0x4052ac3b00?)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1397
+0xd0
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x400c92f200,
0x24820?)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1397
+0xd0
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x400c92ec00,
0x0?)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1397
+0xd0
github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).tryReservedAllocate(0x402375a6c0)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/partition.go:814
+0x90
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).schedule(0x402342df88?)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/context.go:131
+0xf4
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).internalSchedule(0x40003bda70)
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/scheduler.go:75
+0x7c
created by
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/scheduler.go:66
+0x21c {code}
> Check Headroom checking doesn't prevent failure to allocate resource due to
> max resource limit exceeded
> -------------------------------------------------------------------------------------------------------
>
> Key: YUNIKORN-2030
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2030
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: Yongjun Zhang
> Assignee: Yongjun Zhang
> Priority: Major
>
> As reported in YUNIKORN-1996, we are seeing many messages like below from
> time to time:
> {code:java}
> WARN objects/application.go:1504 queue update failed unexpectedly
> {“error”: “allocation (map[memory:37580963840 pods:1 vcore:2000]) puts
> queue ‘root.test-queue’ over maximum allocation (map[memory:3300011278336
> vcore:390584]), current usage (map[memory:3291983380480 pods:91
> vcore:186000])“}{code}
> Restarting Yunikorn helps stoppinging it. Creating this Jira to investigate
> why it happened, because it's not supposed to happen as we check if there is
> enough resource headroom before calling
>
> {code:java}
> func (sa *Application) tryNode(node *Node, ask *AllocationAsk) *Allocation
> {code}
> which printed the above message, and only call it when there is enough
> headroom.
> There maybe a bug in headroom checking?
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]