[jira] [Commented] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures

Weiwei Yang (Jira) Mon, 27 Apr 2020 21:36:55 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094145#comment-17094145
 ]


Weiwei Yang commented on YUNIKORN-42:
-------------------------------------

Hi [~adam.antal]

Thanks for working on this!!

I went through previous discussions and the design doc, I think there are some 
key areas we need to sort out.

1) expose via the rest server or flush to k8s event system

The design addresses both, that makes sense. But which one is more important, 
and which one will come first? My opinion is the k8s event system. Because this 
is how users consume it. The key purpose of this Jira is not making our life 
easier, we need to make users' life easier. That said, by describing 
pods/nodes, they can understand e.g 90% of the reasons why a pod is not 
allocated. 

2) the cache

The key problem is the cache, how we build an efficient cache. The scheduler 
can push events (or records) to this cache, and this cache can be queried (via 
rest) or periodically flushed (to k8s event system).

3) aggregate records

When the scheduler pushes events/records to the cache, dup records should be 
aggregated. Therefore it is important to design the schema of each record, so 
we can properly aggregate them. An example is, when we try to assign a pod, it 
may fail again and again in the scheduler loop, in such case, we would say "pod 
is unable to be allocated due to xxx reason,   N times in past X seconds".

> Better to support POD events for YuniKorn to troubleshoot allocation failures
> -----------------------------------------------------------------------------
>
>                 Key: YUNIKORN-42
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-42
>             Project: Apache YuniKorn
>          Issue Type: Task
>            Reporter: Wangda Tan
>            Assignee: Adam Antal
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now it is tricky to do troubleshoot for pod allocation, we need better expose 
> this information to POD description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

[jira] [Commented] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures

Reply via email to