[jira] [Commented] (YUNIKORN-2280) Possible memory leak in scheduler

Timothy Potter (Jira) Wed, 20 Dec 2023 13:26:05 -0800


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799176#comment-17799176
 ]


Timothy Potter commented on YUNIKORN-2280:
------------------------------------------

here's the goroutine dump  [^goroutine-dump.out] ... the fullstatedump has a 
lot of info about our jobs in there, so need to clean that up a bit before 
posting publicly ... this cluster is about 10,000 spark pods and 27K cpu core 
(420 nodes) ... it's fairly busy but our jobs are fairly long running, so not 
that many allocations going on concurrently, # of running pods is fairly flat 
(some new jobs coming in, others finishing)

> Possible memory leak in scheduler
> ---------------------------------
>
>                 Key: YUNIKORN-2280
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2280
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 1.3.0, 1.4.0
>         Environment: EKS 1.24, we observed same behavior with YK 1.3.0 & 1.4.0
>            Reporter: Timothy Potter
>            Priority: Major
>         Attachments: goroutine-dump.out, heap-dump-1001.out, 
> heap-dump-1255.out, yunikor-scheduler-process-memory.png, 
> yunikorn-process-memory-last9hours.png, yunikorn-scheduler-goroutines.png
>
>
> Memory for our scheduler pod slowly increases until it gets killed by kubelet 
> for surpassing its memory limit. 
> I've included two heap dump files collected about 3 hours apart, see process 
> memory chart for the same period. Not really sure what to make of these heap 
> dumps so hoping someone else who knows the code better might have some 
> insights?
> from heap-dump-1001.out:
> {code}
>       flat  flat%   sum%        cum   cum%
>     1.46GB 24.68% 24.68%     1.46GB 24.68%  reflect.unsafe_NewArray
>     1.30GB 21.94% 46.63%     1.32GB 22.35%  
> sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).literalStore
>     1.06GB 17.96% 64.58%     1.06GB 17.96%  
> k8s.io/apimachinery/pkg/apis/meta/v1.(*FieldsV1).UnmarshalJSON
>     0.88GB 14.87% 79.45%     0.88GB 14.87%  reflect.mapassign_faststr0
> {code}
> from heap-dump-1255.out:
> {code}
>       flat  flat%   sum%        cum   cum%
>  1756.18MB 23.53% 23.53%  1756.18MB 23.53%  reflect.unsafe_NewArray
>  1612.36MB 21.60% 45.13%  1645.86MB 22.05%  
> sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).literalStore
>  1359.86MB 18.22% 63.35%  1359.86MB 18.22%  
> k8s.io/apimachinery/pkg/apis/meta/v1.(*FieldsV1).UnmarshalJSON
>  1136.40MB 15.22% 78.57%  1136.40MB 15.22%  reflect.mapassign_faststr0
> {code}
> We also see odd spikes in the # of goroutines but that doesn't seem 
> correlated with the increase in memory (mainly just mentioning this in case 
> it's unexpected)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

[jira] [Commented] (YUNIKORN-2280) Possible memory leak in scheduler

Reply via email to