[
https://issues.apache.org/jira/browse/YUNIKORN-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493801#comment-17493801
]
Anuraag Nalluri commented on YUNIKORN-946:
------------------------------------------
We can conclude that this bug is caused by the issue fixed in YUNIKORN-776.
To reach this conclusion, we built the scheduler for 2 commits – the ones
preceding and following the merge of YUNIKORN-776. We ran spark-pi applications
on both schedulers and supplied custom applicationId's which conflict with the
default spark job IDs. Before YUNIKORN-776, we can see the application can be
initially created under the latter while the completion event surfaces to the
dashboard for the custom applicationId we provided. This means the api-server's
delete pod informed the incorrect application, thereby leaving the hanging
allocation under the spark job ID's application.
In the commit following YUNIKORN-776, we started 3 spark-pi applications with a
custom applicationIds. The allocation was both issued for and freed up from the
provided applicationId in _all_ cases. This makes sense because the logic now
always checks for applicationId first before the spark-generated app ID:
[https://github.com/apache/incubator-yunikorn-k8shim/pull/288/files]
Attached screenshots to this ticket to show both of these scenarios. Thank you
[~ashutosh-pepper] for reporting and [~wilfreds] for providing additional
context.
> Accounting resources for deleted executor pods
> ----------------------------------------------
>
> Key: YUNIKORN-946
> URL: https://issues.apache.org/jira/browse/YUNIKORN-946
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Affects Versions: 0.11
> Reporter: Ashutosh Singh
> Assignee: Anuraag Nalluri
> Priority: Critical
> Attachments: image-2021-11-16-23-17-42-819.png,
> image-2021-11-16-23-18-28-349.png
>
>
> Even when executors are deleted, YK UI shows that resources are consumed by
> the pod (the one which is already deleted). _kubectl get pods_ does not show
> the executor but YK UI shows the information of a deleted pod consuming
> resources even after few hours.
> It results into leaking cluster resources.
> Steps:
> # Run a spark application using k8s spark operator
> # Wait for executors to be in running state.
> # Delete the application using `kubectl delete sparkapplications <appName>`
> OR `kubectl delete {-}{{-}}f <yaml\{-}file>`
> # All the driver and executor pods would be deleted. check `kubectl get pods`
> # However, YK UI still shows some of the executors running and consuming
> resources. It leads to leak of the resource as they are considered as used
> and could not be used by pending pods.
> More details:
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1637126093006900]
> !image-2021-11-16-23-18-28-349.png|width=534,height=323!
>
> !image-2021-11-16-23-17-42-819.png|width=583,height=353!
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]