Hi Patrick I assume you are using YuniKorn 0.9 build, we have a limitation in this version. When the job finishes, inside of the scheduler, it was not considered as completed accordingly. This is due to in K8s we have no where to track app state, YuniKorn tracks it purely on grouping pods by applicationID label. That means, if you submit pods with appID “A”, pods finished and after sometime you submit the pods with ID “A” again, the scheduler still considers it as the old app “A”. The FIFO order honors the old app “A” order. This is probably the reason why you see a different result after restart.
On the upcoming release, we are trying to fix this in https://issues.apache.org/jira/browse/YUNIKORN-483. We haven’t tried K3s before, but personally I don’t think that matters. Hope this helps Thanks -- Weiwei On Feb 5, 2021, 6:56 PM -0800, Patrick, Alton (US) <[email protected]>, wrote: > Thanks for your answer. You are right -- this is a single-node cluster, and I > was just relying on the limits on the node. I was not setting a queue quota. > > You are also right that when I use annotations on the namespace to set queue > quotas, it works as expected: A-3 gets scheduled before B-1. > > While trying this some more, I also noticed that even when I do *not* set a > queue quota, the order is as-expected (A-3, then B-1) after restarting > Yunikorn. But on subsequent runs, the order is always B-1, then A-3. > > Is it required to have queue quotas to make this work in the expected way? Or > should it also work if resources are only constrained at the node or cluster > level? > > I should probably mention: I am doing this in a cluster created with Rancher > 2.5.5 and running k3s v1.18.8+k3s1. I don’t know if that matters or not. > > > -----Original Message----- > > From: Weiwei Yang <[email protected]> > > Sent: Friday, February 5, 2021 4:47 PM > > To: [email protected] > > Subject: Re: Unexpected scheduler behavior with K8s > > > > *** WARNING *** > > EXTERNAL EMAIL -- This message originates from outside our organization. > > > > > > Hi Patrick > > > > Thanks for reaching out. > > I have one question about how you limit there is only 2 pod running at a > > time. Are you running this on a single node cluster, and limit that at the > > node resource level? > > Based on the use case you described. My guess is the scheduler creates > > Reservations for B-1 and A-3, and when we unreserve these reservations, the > > FIFO ordering was not strictly honored. > > Have you set any queue quota? Here is the doc about how to set the queue > > mapping with quota set: > > http://yunikorn.apache.org/docs/next/user_guide/resource_quota_manageme > > nt#namespace-to-queue-mapping. > > If we limit the running pods by quota, then I think we will see expected > > behavior. > > The document about app sorting policy is here: > > http://yunikorn.apache.org/docs/next/user_guide/sorting_policies#application- > > sorting > > . > > > > Weiwei > > > > On Fri, Feb 5, 2021 at 1:04 PM Patrick, Alton (US) > > <[email protected]> wrote: > > > > > Hi. I have just started using Yunikorn for scheduling with K8s. I have > > > been doing some simple experiments to make sure I understand how it works. > > > Most of them are working as expected, but there is one I don't understand. > > > > > > I used Helm to deploy Yunikorn and I did not modify anything, so I have > > > the default setup: one queue per namespace, and I assume the application > > > sort policy is the default of "FifoSortPolicy." > > > > > > I create four pods. All of them have the resource requests set the same (2 > > > cores, 1Gi mem), and the resource requests are such that only two can run > > > at a time. The pods are created in this order, with a 1 second gap between > > > each: > > > > > > 1. A-1, applicationId = A, sleeps for 10s > > > 2. A-2, applicationId = A, sleeps for 5s > > > 3. B-1, applicationId = B, sleeps for 5s > > > 4. A-3, applicationId = A, sleeps for 5s > > > > > > What I expect to see is: > > > * A-1 is scheduled > > > * A-2 is scheduled > > > * A-2 finishes > > > * A-3 is scheduled (because A is the first application created, as long as > > > there are pods in the queue for application A I understand that they > > > should > > > have priority over pods for application B) > > > > > > What I see instead is that after A-2 finishes, B-1 gets scheduled to run. > > > > > > Is this the expected behavior, and if so can someone explain what is wrong > > > with my understanding? > > > > > > Additionally, in the logs for the scheduler pod, right after pod B-1 gets > > > scheduled, I see the following messages repeated thousands of times very > > > fast (over 2000 instances in about .25s according to timestamps in log). > > > Is > > > this normal? > > > > > > 2021-02-05T20:25:34.479Z DEBUG scheduler/scheduling_application.go:641 > > > skipping node for allocation: basic condition not satisfied {"node": > > > "local-node", "allocationKey": "258d9947-e92b-4967-9758-08eee62f4d1b", > > > "error": "pre alloc check: requested resource map[memory:1074 vcore:2000] > > > is larger than currently available map[ephemeral-storage:50977832921 > > > hugepages-2Mi:0 memory:13250 pods:110 vcore:1600] resource on local- > > node"} > > > 2021-02-05T20:25:34.479Z DEBUG scheduler/scheduling_node.go:271 > > requested > > > resource is larger than currently available node resources {"nodeID": > > > "local-node", "requested": "map[memory:1074 vcore:2000]", "available": > > > "map[ephemeral-storage:50977832921 hugepages-2Mi:0 memory:13250 > > pods:110 > > > vcore:1600]"} > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [email protected] > > > For additional commands, e-mail: [email protected] > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected]
