[ 
https://issues.apache.org/jira/browse/YUNIKORN-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774733#comment-17774733
 ] 

Yongjun Zhang commented on YUNIKORN-2030:
-----------------------------------------

It looks to me the problem might be, when scheduler does allocation for an 
application, it only acquires application lock, not the queue lock:
{code:java}
// tryAllocate will perform a regular allocation of a pending request, includes 
placeholders.
func (sa *Application) tryAllocate(headRoom *resources.Resource, 
preemptionDelay time.Duration, preemptAttemptsRemaining *int, nodeIterator 
func() NodeIterator, fullNodeIterator func() NodeIterator, getNodeFn 
func(string) *Node) *Allocation {
  sa.Lock()
  defer sa.Unlock()
  ...{code}
 

If scheduler concurrently schedule for multiple applications, then it could 
update the queue capacity when doing allocation for one app, and the allocation 
for the other app may fail with the symptom reported in this jira.

The question is, does the scheduler do allocation for multiple allocation 
concurrently?

> Check Headroom checking doesn't prevent failure to allocate resource due to 
> max resource limit exceeded
> -------------------------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2030
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2030
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>            Priority: Major
>
> As reported in YUNIKORN-1996, we are seeing many messages like below from 
> time to time:
> {code:java}
>  WARN    objects/application.go:1504     queue update failed unexpectedly     
>    {“error”: “allocation (map[memory:37580963840 pods:1 vcore:2000]) puts 
> queue ‘root.test-queue’ over maximum allocation (map[memory:3300011278336 
> vcore:390584]), current usage (map[memory:3291983380480 pods:91 
> vcore:186000])“}{code}
> Restarting Yunikorn helps stoppinging it. Creating this Jira to investigate 
> why it happened, because it's not supposed to happen as we check if there is 
> enough resource headroom before calling 
>  
> {code:java}
> func (sa *Application) tryNode(node *Node, ask *AllocationAsk) *Allocation 
> {code}
> which printed the above message, and only call it when there is enough 
> headroom.
> There maybe a bug in headroom checking?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to