[jira] [Commented] (YUNIKORN-2323) Gang scheduling user experience issues

Wilfred Spiegelenburg (Jira) Mon, 22 Jan 2024 01:53:08 -0800


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809334#comment-17809334
 ]


Wilfred Spiegelenburg commented on YUNIKORN-2323:
-------------------------------------------------

I completely agree, events and web UI for all these things.
 # We need to clarify the already generated event that is added to the 
originating pod. Clearly show gang scheduling and what is happening. 
 # fix for this issue should be as simple as an event on the originating pod 
with a clear text. Showing that the app is no longer being gang scheduled. All 
timed out  placeholders should have clear events also added.
 # should be fixed as part of the existing 
[PR|https://github.com/apache/yunikorn-core/pull/745] it is the correct thing 
to add it in that optimisation.

We should use {{GangScheduling}} as the event reason for all these messages 
with reason as the detail. Makes it easy for anyone to look at the pod events 
and see the status of the gang scheduling cycle.

We have an old Jira already, YUNIKORN-570, to show pending asks for 
applications in the web UI. Not just the resoruces, which we have in 1.4 
already, but the real asks. That will help a lot if we have that detail 
viewable easily.

> Gang scheduling user experience issues
> --------------------------------------
>
>                 Key: YUNIKORN-2323
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2323
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 1.4.0
>            Reporter: Manikandan R
>            Assignee: Manikandan R
>            Priority: Major
>
> In case of any issues, users are finding it bit difficult to understand what 
> is going on with the gang app. 
> Issue 1:
> "driver pod is getting struck"
> At times, when driver pod is not able to run successfully for some reasons, 
> users are getting the perspective that pod is getting struck and app is 
> hanged, not moving further. Users are waiting for some time and don't 
> understand the clear picture. How do we close the gap quickly and communicate 
> accordingly through events?
> Issue 2:
> ResumeApplication is fired when all ph's are timed out. Do we need to inform 
> the users about this event as they may not clue any about this significant 
> change?
> Issue 3: 
> When Gang app ph's are in progress (and allocated), when there is request for 
> real asks and there is resource crunch, do we need to trigger auto scaling?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-2323) Gang scheduling user experience issues

Reply via email to