[
https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071340#comment-17071340
]
Wangda Tan commented on YUNIKORN-42:
------------------------------------
[~wilfreds], thanks for sharing your feedback.
I agree that pushing all events to POD/namespace description is not
comprehensive. We may need something like YARN-9050 to get all the necessary
information which is customized to scheduler hierarchies, etc.
However, I think we need to push more information to POD/namespace if possible.
According to early users who using YuniKorn, we found they got used to use
{{describe pod}} to understand why allocation failed. At least for default
scheduler, users can get pretty good insight about why allocation failed.
I think we should try to push the experiences more native when users using
YuniKorn on K8s, and it could be a show-stopper for many users.
bq. The queue is out of resources, the user limit has been reached or even the
maximum number of applications that can be run for a user is reached. That kind
of information does not translate or help on the K8s side.
I think it is possible, we just need to go into the queue/user and populate
these information to event recorder. It is not straightforward, but it should
be doable.
bq. Every event that we push back will be stored in etcd. I am worried about
the amount of data we will push back to etcd if we are not careful.
We definitely need to test more for this, this should be not significantly more
than what K8s existing event system will do since we have throttling mechanism
built-in.
bq. Can we also guarantee that the YuniKorn admin can always describe all the
pods and namespaces? That would require the admin to have high level access to
the K8s cluster which he might not have. Something we need to keep in mind.
A lot of things should be done by end user of the k8s cluster instead of Admin.
Admin should ideally access our UI and understand more details if pod describe
doesn't have enough information. (Of course, we need additional UI works)
> Better to support POD events for YuniKorn to troubleshoot allocation failures
> -----------------------------------------------------------------------------
>
> Key: YUNIKORN-42
> URL: https://issues.apache.org/jira/browse/YUNIKORN-42
> Project: Apache YuniKorn
> Issue Type: Task
> Reporter: Wangda Tan
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Now it is tricky to do troubleshoot for pod allocation, we need better expose
> this information to POD description.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]