[ 
https://issues.apache.org/jira/browse/YUNIKORN-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397066#comment-17397066
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-704:
------------------------------------------------

{quote} # Support to schedule daemonset pods onto "unschedulable" nodes, since 
this is a behavior being documented by the default-scheduler, it's better to be 
consistent with this so it will be easier for the users to replace the default 
scheduler with yunikorn. The doc was: "The default scheduler ignores 
unschedulable Nodes when scheduling DaemonSet Pods."{quote}
That is only part of the change. A daemon set pod belongs on a specific node 
not just any unschedulable node.
{quote} # we can have 2 sub-tasks: 1) for the core side, to support schedule 
containers onto unschedulable node when the container has certain attribute 
attached; 2) discover if a pod belongs to a daemonset and make sure that info 
is passed to the core through the scheduler interface.{quote}
As part of the changes committed for this functionality an extra annotation was 
added as part of the functionality to have the default scheduler do the 
placement work. The daemon set controller adds a special {{nodeSelectorTerms}} 
to the pod which defines exactly which node should be chosen (1).
So just ignoring the fact that the node is marked as unschedulable is not going 
to do the correct thing for a daemon set. We should also leverage that node 
information that is set on the pod.

At this point we have a choice to make:

Approach one is to generically schedule each node for each pod and forget about 
the unschedulable flag for nodes in the core (real simple change). We then rely 
on what is set in the taints and tolerations to prevent placing pods on the 
unschedulable node.

The other approach is to leverage all pieces of information set for these 
daemon set pods and short circuit the daemon set pod placement. Only schedule 
on the specific node, independent of the unschedulable flag, and leave the 
normal cycle as is excluding unschedulable nodes.

I think the current approach is a half & half solution which adds a lot of 
change for really nothing more than the first approach without the added 
functionality we get from the second approach.

(1) https://github.com/kubernetes/kubernetes/issues/59194

> [Umbrella] Use the same mechanism to schedule daemon set pods as the default 
> scheduler
> --------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-704
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-704
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: shim - kubernetes
>            Reporter: Chaoran Yu
>            Assignee: Ting Yao,Huang
>            Priority: Blocker
>             Fix For: 1.0.0
>
>         Attachments: fluent-bit-describe.yaml, fluent-bit.yaml
>
>
> We sometimes see DaemonSet pods fail to be scheduled. Please see attached 
> files for the YAML and _kubectl describe_ output of one such pod. We 
> originally suspected [node 
> reservation|https://github.com/apache/incubator-yunikorn-core/blob/v0.10.0/pkg/scheduler/context.go#L41]
>  was to blame. But even after setting the DISABLE_RESERVATION environment 
> variable to true, we still see such scheduling failures. The issue is 
> especially severe when K8s nodes have disk pressure that causes lots of pods 
> to be evicted. Newly created pods will stay in pending forever. We have to 
> temporarily uninstall YuniKorn and let the default scheduler do the 
> scheduling for these pods. 
> This issue is critical because lots of important pods belong to a DaemonSet, 
> such as Fluent Bit, a common logging solution. This is probably the last 
> remaining roadblock for us to have the confidence to have YuniKorn entirely 
> replace the default scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to