[ 
https://issues.apache.org/jira/browse/YUNIKORN-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277760#comment-17277760
 ] 

Weiwei Yang edited comment on YUNIKORN-460 at 3/2/21, 8:20 AM:
---------------------------------------------------------------

hi [~kmarton] your notes captured our discussion. Please review the following 
cases, and make sure what you have in the draft PR can handle all these cases 
(except the retry part)

1. the scheduler tried but non of app's pending placeholders get allocated, all 
pods are pending
after the timeout, all pending placeholders should be deleted
failed the app {color:red}[1]{color}

2. not all app's pending placeholders are allocated
this means there is no real pending ask yet
after timeout, all allocated and pending placeholders should be deleted
failed the app

3. all app's placeholders are allocated, no real pods submitted
after timeout, all allocated placeholders should be deleted
app transited to the completed state

4. all app's placeholders are allocated, only part of them gets replaced (min 
gang member > actual task number)
after timeout, app transited to the completed state
all remaining placeholders should be deleted

{color:red}[1]{color} Why failed the app? That means: the scheduler tried to do 
reservation for the app, but could not go through. When the app is failed, we 
can simply notify the shim about app's state, and then accordingly, the shim 
can release all placeholders. Note, we will see the app's real pods are 
pending, client side needs to cleanup the job. We can further build the "retry" 
logic on the shim side to re-submit the app again in e.g a few minutes. 


was (Author: wwei):
hi [~kmarton] your notes captured our discussion. Please review the following 
cases, and make sure what you have in the draft PR can handle all these cases 
(except the retry part)

1. the scheduler tried but non of app's pending placeholders get allocated, all 
pods are pending
after the timeout, all pending placeholders should be deleted
failed the app {color:red}[1]{color}

2. not all app's pending placeholders are allocated
after timeout, all allocated and pending placeholders should be deleted
failed the app

3. all app's placeholders are allocated, no real pods submitted
after timeout, all allocated placeholders should be deleted
app transited to the completed state

4. all app's placeholders are allocated, only part of them gets replaced (min 
gang member > actual task number)
after timeout, app transited to the completed state
all remaining placeholders should be deleted

{color:red}[1]{color} Why failed the app? That means: the scheduler tried to do 
reservation for the app, but could not go through. When the app is failed, we 
can simply notify the shim about app's state, and then accordingly, the shim 
can release all placeholders. Note, we will see the app's real pods are 
pending, client side needs to cleanup the job. We can further build the "retry" 
logic on the shim side to re-submit the app again in e.g a few minutes. 

> Handle app reservation timeout
> ------------------------------
>
>                 Key: YUNIKORN-460
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-460
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: shim - kubernetes
>            Reporter: Weiwei Yang
>            Assignee: Kinga Marton
>            Priority: Major
>              Labels: pull-request-available
>
> When an app is configured with a timeout, that determines the maximum time 
> permitted to stay in the Reserving phase. If that times out, then all the 
> existing placeholders should be deleted and the application will be scheduled 
> normally. This timeout is needed because otherwise an app’s partial 
> placeholders may occupy cluster resources and they are wasted.
> See more in [this 
> doc|https://docs.google.com/document/d/1P-g4plXIJ9Xybp-jyKySI18P3rkGQPuTutGYhv1LaQ8/edit#heading=h.ebk2htgnnrex]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to