[jira] [Comment Edited] (YUNIKORN-2030) Check Headroom checking doesn't prevent failure to allocate resource due to max resource limit exceeded

Yongjun Zhang (Jira) Wed, 17 Jan 2024 13:25:26 -0800


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807869#comment-17807869
 ]


Yongjun Zhang edited comment on YUNIKORN-2030 at 1/17/24 9:24 PM:
------------------------------------------------------------------

Shared with Wilfred earlier that YUNIKORN-1993 was not the root cause of our 
problem, we have it included it in our build for quite long now.


Here is a stack trace I got:
{code:java}
WARN    objects/application.go:1506     queue update failed unexpectedly        
{"error": "allocation (map[memory:25521291264 pods:1 vcore:2000]) puts queue 
'root.x.y.z' over maximum allocation (map[memory:3328725483520 vcore:393976]), 
current usage (map[memory:3320344215552 pods:131 vcore:262000])"}{code}
{code:java}
created by 
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/scheduler.go:66
 +0x21c
goroutine 123 [running]:
runtime/debug.Stack()
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/.go/src/runtime/debug/stack.go:24
 +0x64
runtime/debug.PrintStack()
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/.go/src/runtime/debug/stack.go:16
 +0x1c
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNode(0x4058d6a960,
 0x40449b3970, 0x40aa3e49a0)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/application.go:1508
 +0x210
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNodesNoReserve(0x402342d080?,
 0x40aa3e49a0, {0x19c78f0, 0x406a71fa40}, {0x4004b34060, 0x1e})
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/application.go:1382
 +0xd4
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryReservedAllocate(0x4058d6a960,
 0x40a349d7a8, 0x402342dd98)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/application.go:1305
 +0x3a0
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x4024554d80,
 0x4052ac3b00?)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1378
 +0x304
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x4024554c00,
 0x4052ac3b00?)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1397
 +0xd0
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x400c92f200,
 0x24820?)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1397
 +0xd0
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x400c92ec00,
 0x0?)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1397
 +0xd0
github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).tryReservedAllocate(0x402375a6c0)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/partition.go:814
 +0x90
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).schedule(0x402342df88?)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/context.go:131
 +0xf4
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).internalSchedule(0x40003bda70)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/scheduler.go:75
 +0x7c
created by 
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/scheduler.go:66
 +0x21c {code}


was (Author: yzhangal):
Here is a stack trace I got:


{code:java}
WARN    objects/application.go:1506     queue update failed unexpectedly        
{"error": "allocation (map[memory:25521291264 pods:1 vcore:2000]) puts queue 
'root.x.y.z' over maximum allocation (map[memory:3328725483520 vcore:393976]), 
current usage (map[memory:3320344215552 pods:131 vcore:262000])"}{code}
{code:java}
created by 
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/scheduler.go:66
 +0x21c
goroutine 123 [running]:
runtime/debug.Stack()
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/.go/src/runtime/debug/stack.go:24
 +0x64
runtime/debug.PrintStack()
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/.go/src/runtime/debug/stack.go:16
 +0x1c
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNode(0x4058d6a960,
 0x40449b3970, 0x40aa3e49a0)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/application.go:1508
 +0x210
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNodesNoReserve(0x402342d080?,
 0x40aa3e49a0, {0x19c78f0, 0x406a71fa40}, {0x4004b34060, 0x1e})
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/application.go:1382
 +0xd4
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryReservedAllocate(0x4058d6a960,
 0x40a349d7a8, 0x402342dd98)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/application.go:1305
 +0x3a0
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x4024554d80,
 0x4052ac3b00?)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1378
 +0x304
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x4024554c00,
 0x4052ac3b00?)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1397
 +0xd0
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x400c92f200,
 0x24820?)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1397
 +0xd0
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).TryReservedAllocate(0x400c92ec00,
 0x0?)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/objects/queue.go:1397
 +0xd0
github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).tryReservedAllocate(0x402375a6c0)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/partition.go:814
 +0x90
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).schedule(0x402342df88?)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/context.go:131
 +0xf4
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).internalSchedule(0x40003bda70)
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/scheduler.go:75
 +0x7c
created by 
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService
        
/mnt/jenkins-data/workspace/build-yunikorn-docker-image-worker/yunikorn-core/pkg/scheduler/scheduler.go:66
 +0x21c {code}

> Check Headroom checking doesn't prevent failure to allocate resource due to 
> max resource limit exceeded
> -------------------------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2030
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2030
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>            Priority: Major
>
> As reported in YUNIKORN-1996, we are seeing many messages like below from 
> time to time:
> {code:java}
>  WARN    objects/application.go:1504     queue update failed unexpectedly     
>    {“error”: “allocation (map[memory:37580963840 pods:1 vcore:2000]) puts 
> queue ‘root.test-queue’ over maximum allocation (map[memory:3300011278336 
> vcore:390584]), current usage (map[memory:3291983380480 pods:91 
> vcore:186000])“}{code}
> Restarting Yunikorn helps stoppinging it. Creating this Jira to investigate 
> why it happened, because it's not supposed to happen as we check if there is 
> enough resource headroom before calling 
>  
> {code:java}
> func (sa *Application) tryNode(node *Node, ask *AllocationAsk) *Allocation 
> {code}
> which printed the above message, and only call it when there is enough 
> headroom.
> There maybe a bug in headroom checking?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YUNIKORN-2030) Check Headroom checking doesn't prevent failure to allocate resource due to max resource limit exceeded

Reply via email to