[
https://issues.apache.org/jira/browse/YUNIKORN-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397066#comment-17397066
]
Wilfred Spiegelenburg commented on YUNIKORN-704:
------------------------------------------------
{quote} # Support to schedule daemonset pods onto "unschedulable" nodes, since
this is a behavior being documented by the default-scheduler, it's better to be
consistent with this so it will be easier for the users to replace the default
scheduler with yunikorn. The doc was: "The default scheduler ignores
unschedulable Nodes when scheduling DaemonSet Pods."{quote}
That is only part of the change. A daemon set pod belongs on a specific node
not just any unschedulable node.
{quote} # we can have 2 sub-tasks: 1) for the core side, to support schedule
containers onto unschedulable node when the container has certain attribute
attached; 2) discover if a pod belongs to a daemonset and make sure that info
is passed to the core through the scheduler interface.{quote}
As part of the changes committed for this functionality an extra annotation was
added as part of the functionality to have the default scheduler do the
placement work. The daemon set controller adds a special {{nodeSelectorTerms}}
to the pod which defines exactly which node should be chosen (1).
So just ignoring the fact that the node is marked as unschedulable is not going
to do the correct thing for a daemon set. We should also leverage that node
information that is set on the pod.
At this point we have a choice to make:
Approach one is to generically schedule each node for each pod and forget about
the unschedulable flag for nodes in the core (real simple change). We then rely
on what is set in the taints and tolerations to prevent placing pods on the
unschedulable node.
The other approach is to leverage all pieces of information set for these
daemon set pods and short circuit the daemon set pod placement. Only schedule
on the specific node, independent of the unschedulable flag, and leave the
normal cycle as is excluding unschedulable nodes.
I think the current approach is a half & half solution which adds a lot of
change for really nothing more than the first approach without the added
functionality we get from the second approach.
(1) https://github.com/kubernetes/kubernetes/issues/59194
> [Umbrella] Use the same mechanism to schedule daemon set pods as the default
> scheduler
> --------------------------------------------------------------------------------------
>
> Key: YUNIKORN-704
> URL: https://issues.apache.org/jira/browse/YUNIKORN-704
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: shim - kubernetes
> Reporter: Chaoran Yu
> Assignee: Ting Yao,Huang
> Priority: Blocker
> Fix For: 1.0.0
>
> Attachments: fluent-bit-describe.yaml, fluent-bit.yaml
>
>
> We sometimes see DaemonSet pods fail to be scheduled. Please see attached
> files for the YAML and _kubectl describe_ output of one such pod. We
> originally suspected [node
> reservation|https://github.com/apache/incubator-yunikorn-core/blob/v0.10.0/pkg/scheduler/context.go#L41]
> was to blame. But even after setting the DISABLE_RESERVATION environment
> variable to true, we still see such scheduling failures. The issue is
> especially severe when K8s nodes have disk pressure that causes lots of pods
> to be evicted. Newly created pods will stay in pending forever. We have to
> temporarily uninstall YuniKorn and let the default scheduler do the
> scheduling for these pods.
> This issue is critical because lots of important pods belong to a DaemonSet,
> such as Fluent Bit, a common logging solution. This is probably the last
> remaining roadblock for us to have the confidence to have YuniKorn entirely
> replace the default scheduler.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]