craigcondit commented on code in PR #471:
URL: https://github.com/apache/yunikorn-site/pull/471#discussion_r1774080537


##########
docs/design/foreign_pod_tracking.md:
##########
@@ -0,0 +1,326 @@
+---
+id: foreign_pod_tracking
+title: Tracking non-Yunikorn pods
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Introduction
+
+Currently, we only know about pods in the scheduler core which have been 
scheduled by Yunikorn. Resource calculation which decides whether a pod fits a 
node or not depends on a field called `occupiedResources`. This is first 
calculated in the shim then sent to the core.
+
+This is not ideal for a couple of reasons:
+1. Pod count is not maintained
+2. The state dump and UI only contain Yunikorn pods
+3. The occupied field is not relevant inside the shim, only in the core
+
+If we have issues regarding available resources, it’s very difficult to figure 
out what happened. With adequate logging and proper pod tracking, 
troubleshooting is easier.
+
+## Proposed changes
+
+### K8s shim
+
+**High level overview**
+
+In the shim, we stop tracking the total amount of resources occupied by 
non-Yunikorn pods
+(aka. "occupied resources"). Instead, whenever a new foreign pod is allocated 
or deleted, we
+forward this information to the scheduler core. 
+The allocations not scheduled by YuniKorn will be accounted for on the node 
but not as part
+of a queue or application. Allocations will be marked by the shim as scheduled 
by a different
+scheduler by adding a new entry to the `AllocationTags` map inside the 
`si.Allocation` type.
+
+**State initialization**
+
+Registering existing pods got significantly simpler after YUNIKORN-2180 and 
YUNIKORN-2779.
+`NodeInfo.ExistingAllocations` no longer exists.
+
+No change is needed. Pods that are already assigned to a node are registered 
via the
+`Context.AddPod()` callback method. When Yunikorn is initializing the state, 
existing pods are
+seen as new ones. From the perspective of the shim, it doesn’t matter if it 
got created after
+the startup or it had already existed before.
+
+**Node updates**
+
+Current: we examine the old and new available resources based on the 
`Node.NodeStatus.Allocatable` field.
+If they are different, we update the scheduler cache with 
`SchedulerCache.UpdateCapacity()` and send the
+new value to the core.
+
+Proposed change: we continue to send NodeInfo events, but we will stop using 
the `OccupiedResource` field.
+
+**Foreign pod node assignment**
+
+Current: the shim calculates the occupied resources for the given node. This 
happens in the
+`Context.updateNodeOccupiedResources()` method. First, the scheduler cache is 
updated with
+`SchedulerCache.UpdateOccupiedResource()`. Then we send the new available 
amount to the core
+with `SchedulerAPI.UpdateNode()` which also sends a `NodeInfo` event.
+
+Proposed change: instead of tracking just the resources occupied by all pods 
not scheduled
+via YuniKorn on a node, send all pods to the core as allocations.
+
+**Foreign pod deletion**
+
+Current: same as the update case.
+
+Proposed change: send an `AllocationReleasesRequest` (embedded in a 
`AllocationRequest`) to the core.
+
+**Preemption considerations**
+
+If we want to support preemption of foreign pods, we also have to know if it 
is a static pod. A
+static pod is managed directly by the kubelet and cannot be controlled from 
the API server.
+Therefore they cannot become preemption targets.
+The owner reference of a static pod is "Node". When an allocation is added, 
this has to be
+indicated explicitly in the SI object. The `si.Allocation` type has an 
`AllocationTags` map
+which stores metadata about the pod. In the helper function 
`CreateTagsForTask()`, the
+owner reference is checked if this is a static pod, then an appropriate entry 
to the tags is
+added.
+
+**Overview of code changes**
+- Delete `Context.updateNodeOccupiedResources()`
+- `Context.updateForeignPod()` - remove reference to
+`ctx.updateNodeOccupiedResources()` and unify it with 
`Context.updateYunikornPod()`
+- `Context.deleteForeignPod()` - remove reference to
+`ctx.updateNodeOccupiedResources()` and
+unify it with `Context.deleteYunikornPod()`
+- Delete `SchedulerCache.UpdateCapacity()`,
+`SchedulerCache.UpdateOccupiedResource()` and the related fields: 
`nodeOccupied`,
+`nodeCapacity`
+- `Context.updateNodeInternal()` - remove call to 
`schedulerCache.UpdateCapacity()`
+
+### Scheduler interface
+
+**Node object**
+
+The following type is sent to the core when a node update occurs:
+```
+type NodeInfo struct {
+  NodeID               string
+  Action               NodeInfo_ActionFromRM
+  Attributes           map[string]string
+  SchedulableResource *Resource
+  OccupiedResource    *Resource
+}
+```
+
+`SchedulableResource` represents the capacity of the node, while 
`OccupiedResource` tells
+how much is available for pods.
+
+Proposal: remove the `OccupiedResource` field because it won’t be used anymore.
+
+**Event system**
+
+With the introduction of tracking all pods in the cluster, now we can generate 
events about
+each individual allocations. Events for Yunikorn pods already exist: 
`APP_ALLOC` application
+event and `NODE_ALLOC` node event. We won’t use the `APP_ALLOC` because we 
can’t tie the
+non-Yunikorn allocation to an existing application.
+Proposal: still generate `NODE_OCCUPIED` in the next release (Yunkorn 1.7), 
but announce the
+deprecation of this event. It will be removed in Yunikorn 1.8.
+We will also generate `NODE_ALLOC` for non-Yunikorn allocations, marking it 
foreign in the
+"message" field.
+
+|Event type|Event ID|State|When is it generated|
+|--|--|--|--|
+|Node|NODE_OCCUPIED|Existing - to be removed in 1.8|Occupied resource changes 
on a node|
+|Node|NODE_ALLOC|Already exists for Yunikorn allocations|Allocation (node 
assignment) occurs|
+
+**Constants**
+
+One constans will be introduced as a key for application tags. All pods 
scheduled by a
+different scheduler than YuniKorn will have the tag `foreign`.
+Values for the foreign tag are:
+- `static`: indicates a static pod, hard skip for preemption
+- `default`: preemptable pod
+
+### Scheduler core
+
+**Scheduling context**
+
+Processing updates happens in `ClusterContext.updateNode()`. There is a branch 
for
+`NodeInfo_UPDATE`:
+
+```
+case si.NodeInfo_UPDATE:
+  if sr := nodeInfo.SchedulableResource; sr != nil {
+    
partition.updatePartitionResource(node.SetCapacity(resources.NewResourceFromProto(sr)))
+  }
+
+  if or := nodeInfo.OccupiedResource; or != nil {
+    node.SetOccupiedResource(resources.NewResourceFromProto(or))
+  }
+```
+
+Proposed changes: remove the path `or != nil`, because that field won’t be 
used.
+
+**Allocation object**
+
+The `objects.Allocations` type needs to be extended. The scheduler core needs 
to know if
+the allocation was scheduled by Yunikorn or not and if we want to support 
preemption, we
+have to mark allocations which represent static pods. These will be determined 
from the
+allocation tags explained in the K8s shim part.
+
+Two flags (boolean) to be added to the objects.Allocations that can only be 
set on
+creation. The flags can be read but never modified.
+
+**Partition**
+
+After YUNIKORN-2457, adding foreign allocations became much simpler. Some minor
+changes inside `PartitionContext.UpdateAllocation()` are necessary to ignore 
the
+application lookup for foreign pods.
+
+We might track foreign pods as part of an application in the future but that 
is out of scope of
+this change.
+
+
+**REST Interface**
+
+The way how allocations are counted gives the user more detailed information 
about the
+resource usage of all pods. Now we can show individual foreign allocations, 
not just a single
+counter.
+
+In Yunikorn 1.7, the data in the REST output will not change. We can still 
calculate
+OccupiedResources information per node. We also announce that 
AllocatedResources is
+about to change, because the value will be based on all pods on the node.
+
+From Yunikorn 1.8 onwards, we remove OccupiedResources and update the 
calculation of
+`AllocatedResources`.
+
+To support easier integration with the web UI, a separate list will be 
maintained for the two
+types of allocations. Allocations made by Yunikorn are still available under 
`allocations`.
+Non-Yunikorn allocations will be listed under `foreign_allocations`.

Review Comment:
   Nit: Let's not use underscores in fields: `foreignAllocations` vs. 
`foreign_allocations`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to