[ 
https://issues.apache.org/jira/browse/YUNIKORN-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272525#comment-17272525
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-516:
------------------------------------------------

The current comment in the code is this:
{code:java}
// Remove the allocation(s) from the app and nodes
// YUNIKORN-461 proposes to remove this, with a side note that this currently 
could lead to a deadlock
func (pc *PartitionContext) removeAllocation(release *si.AllocationRelease) 
([]*objects.Allocation, *objects.Allocation) {{code}
So yes that is a known issue which we covered under YUNIKORN-461.

Let me check if we need to keep that list, closing this as a dupe of 
YUNIKORN-461

> Yunikorn scheduler seems to be in deadlock state
> ------------------------------------------------
>
>                 Key: YUNIKORN-516
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-516
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Ayub Pathan
>            Assignee: Wilfred Spiegelenburg
>            Priority: Blocker
>         Attachments: metrics, stack, yk.log
>
>
> Apply below job templates to reproduce the issue.
>  # First application with gang scheduling annotations
>   
> {noformat}
> apiVersion: batch/v1
> kind: Job
> metadata:
>   name: batch-sleep-job-1
> spec:
>   completions: 2
>   parallelism: 2
>   template:
>     metadata:
>       labels:
>         app: sleep
>         applicationId: "batch-sleep-job-1"
>         queue: root.sandbox
>       annotations:
>         yunikorn.apache.org/task-group-name: tg1
>         yunikorn.apache.org/task-groups: |-
>           [{
>               "name": "tg1",
>               "minMember": 2,
>               "minResource": {
>                 "cpu": "100m",
>                 "memory": "500M"
>               },
>               "nodeSelector": {},
>               "tolerations": []
>           }]
>     spec:
>       schedulerName: yunikorn
>       restartPolicy: Never
>       containers:
>         - name: sleep300
>           image: "alpine:latest"
>           command: ["sleep", "300"]
>           resources:
>             requests:
>               cpu: "100m"
>               memory: "500M" {noformat}
>  
> 2.  First application to the same task group
> {noformat}
>  apiVersion: batch/v1
> kind: Job
> metadata:
>   name: batch-sleep-job-2
> spec:
>   completions: 4
>   parallelism: 4
>   template:
>     metadata:
>       labels:
>         app: sleep
>         applicationId: "batch-sleep-job-2"
>         queue: root.sandbox
>       annotations:
>         yunikorn.apache.org/task-group-name: tg1
>         yunikorn.apache.org/task-groups: |-
>           [{
>               "name": "tg1",
>               "minMember": 2,
>               "minResource": {
>                 "cpu": "100m",
>                 "memory": "500M"
>               },
>               "nodeSelector": {},
>               "tolerations": []
>           }]
>     spec:
>       schedulerName: yunikorn
>       restartPolicy: Never
>       containers:
>         - name: sleep300
>           image: "alpine:latest"
>           command: ["sleep", "300"]
>           resources:
>             requests:
>               cpu: "100m"
>               memory: "500M"{noformat}
>  
> 3. Third application to the same task group
> {noformat}
> apiVersion: batch/v1
> kind: Job
> metadata:
>   name: batch-sleep-job-3
> spec:
>   completions: 10
>   parallelism: 10
>   template:
>     metadata:
>       labels:
>         app: sleep
>         applicationId: "batch-sleep-job-3"
>         queue: root.sandbox
>       annotations:
>         yunikorn.apache.org/task-group-name: tg1
>         yunikorn.apache.org/task-groups: |-
>           [{
>               "name": "tg1",
>               "minMember": 3,
>               "minResource": {
>                 "cpu": "100m",
>                 "memory": "500M"
>               },
>               "nodeSelector": {},
>               "tolerations": []
>           }]
>     spec:
>       schedulerName: yunikorn
>       restartPolicy: Never
>       containers:
>         - name: sleep300
>           image: "alpine:latest"
>           command: ["sleep", "300"]
>           resources:
>             requests:
>               cpu: "100m"
>               memory: "500M" {noformat}
> Now it can be seen that, the 3rd application is in pending state even though 
> the place holder apps are created and terminated.
> {noformat}
> NAME↑                    READY STATUS     RS CPU MEM %CPU/R  %MEM/R  %CPU/L  
> %MEM/L IP                NODE                                              
> QOS  AGE    │
> │ batch-sleep-job-1-7lrd5  0/1   Completed   0 n/a n/a    n/a     n/a     n/a 
>     n/a 100.100.142.208   ip-10-192-143-108.ca-central-1.compute.internal   
> BU   18m    │
> │ batch-sleep-job-1-lw4t9  0/1   Completed   0 n/a n/a    n/a     n/a     n/a 
>     n/a 100.100.134.213   ip-10-192-136-201.ca-central-1.compute.internal   
> BU   18m    │
> │ batch-sleep-job-2-c95dg  0/1   Completed   0 n/a n/a    n/a     n/a     n/a 
>     n/a 100.100.142.210   ip-10-192-143-108.ca-central-1.compute.internal   
> BU   17m    │
> │ batch-sleep-job-2-vnfjb  0/1   Completed   0 n/a n/a    n/a     n/a     n/a 
>     n/a 100.100.142.211   ip-10-192-143-108.ca-central-1.compute.internal   
> BU   17m    │
> │ batch-sleep-job-2-x4mcz  0/1   Completed   0 n/a n/a    n/a     n/a     n/a 
>     n/a 100.100.134.216   ip-10-192-136-201.ca-central-1.compute.internal   
> BU   17m    │
> │ batch-sleep-job-2-ztnfq  0/1   Completed   0 n/a n/a    n/a     n/a     n/a 
>     n/a 100.100.134.217   ip-10-192-136-201.ca-central-1.compute.internal   
> BU   17m    │
> │ batch-sleep-job-3-7tp5t  0/0   Pending     0 n/a n/a    n/a     n/a     n/a 
>     n/a n/a               n/a                                               
> BU   16m    │
> │ batch-sleep-job-3-59mnj  0/0   Pending     0 n/a n/a    n/a     n/a     n/a 
>     n/a n/a               n/a                                               
> BU   16m    │
> │ batch-sleep-job-3-bm4fd  0/0   Pending     0 n/a n/a    n/a     n/a     n/a 
>     n/a n/a               n/a                                               
> BU   16m    │
> │ batch-sleep-job-3-c4mxg  0/0   Pending     0 n/a n/a    n/a     n/a     n/a 
>     n/a n/a               n/a                                               
> BU   16m    │
> │ batch-sleep-job-3-cljfj  0/0   Pending     0 n/a n/a    n/a     n/a     n/a 
>     n/a n/a               n/a                                               
> BU   16m    │
> │ batch-sleep-job-3-gcvnp  0/0   Pending     0 n/a n/a    n/a     n/a     n/a 
>     n/a n/a               n/a                                               
> BU   16m    │
> │ batch-sleep-job-3-gwgnn  0/0   Pending     0 n/a n/a    n/a     n/a     n/a 
>     n/a n/a               n/a                                               
> BU   16m    │
> │ batch-sleep-job-3-kj88t  0/0   Pending     0 n/a n/a    n/a     n/a     n/a 
>     n/a n/a               n/a                                               
> BU   16m    │
> │ batch-sleep-job-3-p8c7w  0/0   Pending     0 n/a n/a    n/a     n/a     n/a 
>     n/a n/a               n/a                                               
> BU   16m    │
> │ batch-sleep-job-3-td575  0/0   Pending     0 n/a n/a    n/a     n/a     n/a 
>     n/a n/a               n/a                                               
> BU   16m{noformat}
> Attaching [^stack]trace, [^yk.log]and [^metrics] API response for reference. 
> This is observed with v0.10 build.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to