[ 
https://issues.apache.org/jira/browse/YUNIKORN-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manikandan R updated YUNIKORN-2323:
-----------------------------------
    Description: 
In case of any issues, users are finding it bit difficult to understand what is 
going on with the gang app. 

Issue 1:

"driver pod is getting struck"

At times, when driver pod is not able to run successfully for some reasons, 
users are getting the perspective that pod is getting struck and app is hanged, 
not moving further. Users are waiting for some time and don't understand the 
clear picture. How do we close the gap quickly and communicate accordingly 
through events?

Issue 2:

ResumeApplication is fired when all ph's are timed out. Do we need to inform 
the users about this event as they may not clue any about this significant 
change?

Issue 3: 

When Gang app ph's are in progress (and allocated), when there is request for 
real asks and there is resource crunch, do we need to trigger auto scaling?

  was:
For gang app, ResumeApplicationEvent would be set as part of 
timeoutPlaceholderProcessing process if needed. Which moves the app to running 
only when src is either new or accepted. A Gang App moves to the starting state 
once all placeholder ask have been added to the application. So, a situation 
wherein resume events trigger and doing the expected thing won't even arises. 
In addition, the app might have also transitioned into running state based on 
app start timer expiry (default is 5 mins). Without even being aware of current 
situations, timer moves the state to running which is not the right thing to do.
 
Ideally, in the worst case, a gang app should continue to run as a normal app 
but given the above scenarios, it doesn't happen
 


> Gang scheduling user experience issues
> --------------------------------------
>
>                 Key: YUNIKORN-2323
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2323
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 1.4.0
>            Reporter: Manikandan R
>            Assignee: Manikandan R
>            Priority: Major
>
> In case of any issues, users are finding it bit difficult to understand what 
> is going on with the gang app. 
> Issue 1:
> "driver pod is getting struck"
> At times, when driver pod is not able to run successfully for some reasons, 
> users are getting the perspective that pod is getting struck and app is 
> hanged, not moving further. Users are waiting for some time and don't 
> understand the clear picture. How do we close the gap quickly and communicate 
> accordingly through events?
> Issue 2:
> ResumeApplication is fired when all ph's are timed out. Do we need to inform 
> the users about this event as they may not clue any about this significant 
> change?
> Issue 3: 
> When Gang app ph's are in progress (and allocated), when there is request for 
> real asks and there is resource crunch, do we need to trigger auto scaling?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to