[
https://issues.apache.org/jira/browse/YUNIKORN-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276393#comment-17276393
]
Kinga Marton edited comment on YUNIKORN-460 at 2/1/21, 3:04 PM:
----------------------------------------------------------------
Today we had a sync with [~wilfreds] on this topic. I am summarising here what
we discussed:
When it comes about the timeout we have 2 cases
# The queue is full, so only a part of the placeholders got allocated(for
example the app ask for 100GB but the placeholders are using 50GB)
# The placeholders are all allocated, but not all of them were replaced by
real pods ( it can be due to configuration issue, but can be because something
is changed in the cluster as well)
We will kill the placeholder pods in both cases if it times out, but we will
not kill the whole application, so in the second case the already running real
pods will continue to do their job. We kill only the placeholders.
* We will start the timer when the first placeholder is getting allocated
* When it times out we just kill all the placeholders if we have any
[~wwei] related to the new state you mentioned, I don't think that we can add
this new state, because when the first placeholder is replaced by the new pod,
the application is already transitioning into the Running state. I don't think
it is a good idea to make a difference between a simple app and one with a gang
defined related to when it will start running.
[~wilfreds] please correct me if I am wrong, or if I missed something.
was (Author: kmarton):
Today we had a sync with [~wilfreds] on this topic. I am summarising here what
we discussed:
When it comes about the timeout we have 2 cases
# The queue is full, so only a part of the placeholders got allocated(for
example the app ask for 100GB but the placeholders are using 50GB)
# The placeholders are all allocated, but not all of them were replaced by
real pods ( it can be due to configuration issue, but can be because something
is changed in the cluster as well)
We will kill the placeholder pods in both cases if it times out, but we will
not kill the whole application, so in the second case the already running real
pods will continue to do their job. We kill only the placeholders.
* We will start the timer when the first placeholder is getting allocated
* When it times out we just kill all the placeholders if we have any
[~wwei] related to the new state you mentioned, I don't think that we can add
this new state, because when the first placeholder is replaced by the new pod,
the application is already transitioning into the Running state. I don't think
it is a good idea to make a difference between a simple app and one with a gang
defined related to when it will start running.
> Handle app reservation timeout
> ------------------------------
>
> Key: YUNIKORN-460
> URL: https://issues.apache.org/jira/browse/YUNIKORN-460
> Project: Apache YuniKorn
> Issue Type: Sub-task
> Components: shim - kubernetes
> Reporter: Weiwei Yang
> Assignee: Kinga Marton
> Priority: Major
>
> When an app is configured with a timeout, that determines the maximum time
> permitted to stay in the Reserving phase. If that times out, then all the
> existing placeholders should be deleted and the application will be scheduled
> normally. This timeout is needed because otherwise an app’s partial
> placeholders may occupy cluster resources and they are wasted.
> See more in [this
> doc|https://docs.google.com/document/d/1P-g4plXIJ9Xybp-jyKySI18P3rkGQPuTutGYhv1LaQ8/edit#heading=h.ebk2htgnnrex]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]