RE: Unexpected scheduler behavior with K8s

Patrick, Alton (US) Tue, 09 Feb 2021 09:02:25 -0800

I'm sorry, I don't understand this. The behavior you are describing ("The FIFO 
order honors the old app “A” order") is the behavior I was expecting, but not 
what I'm seeing. I would think that means that any pod with appID A should 
always be scheduled before a pod with appID B, since application A was created 
first. But I'm seeing that sometimes pods with appID A get scheduled after pods 
with appID B.


> -----Original Message-----
> From: Weiwei Yang <[email protected]>
> Sent: Saturday, February 6, 2021 7:36 PM
> To: [email protected]; [email protected]
> Subject: RE: Unexpected scheduler behavior with K8s
> 
> Hi Patrick
> 
> I assume you are using YuniKorn 0.9 build, we have a limitation in this 
> version.
> When the job finishes, inside of the scheduler, it was not considered as
> completed accordingly. This is due to in K8s we have no where to track app
> state, YuniKorn tracks it purely on grouping pods by applicationID label. That
> means, if you submit pods with appID “A”,  pods finished and after sometime
> you submit the pods with ID “A” again, the scheduler still considers it as 
> the old
> app “A”. The FIFO order honors the old app “A” order. This is probably the
> reason why you see a different result after restart.
> 
> On the upcoming release, we are trying to fix this
> in https://issues.apache.org/jira/browse/YUNIKORN-483. We haven’t tried K3s
> before, but personally I don’t think that matters.
> 
> Hope this helps
> Thanks
> 
> --
> Weiwei
> On Feb 5, 2021, 6:56 PM -0800, Patrick, Alton (US)
> <[email protected]>, wrote:
> > Thanks for your answer. You are right -- this is a single-node cluster, and 
> > I was
> just relying on the limits on the node. I was not setting a queue quota.
> >
> > You are also right that when I use annotations on the namespace to set queue
> quotas, it works as expected: A-3 gets scheduled before B-1.
> >
> > While trying this some more, I also noticed that even when I do *not* set a
> queue quota, the order is as-expected (A-3, then B-1) after restarting 
> Yunikorn.
> But on subsequent runs, the order is always B-1, then A-3.
> >
> > Is it required to have queue quotas to make this work in the expected way? 
> > Or
> should it also work if resources are only constrained at the node or cluster 
> level?
> >
> > I should probably mention: I am doing this in a cluster created with Rancher
> 2.5.5 and running k3s v1.18.8+k3s1. I don’t know if that matters or not.
> >
> > > -----Original Message-----
> > > From: Weiwei Yang <[email protected]>
> > > Sent: Friday, February 5, 2021 4:47 PM
> > > To: [email protected]
> > > Subject: Re: Unexpected scheduler behavior with K8s
> > >
> > > *** WARNING ***
> > > EXTERNAL EMAIL -- This message originates from outside our organization.
> > >
> > >
> > > Hi Patrick
> > >
> > > Thanks for reaching out.
> > > I have one question about how you limit there is only 2 pod running at a
> > > time. Are you running this on a single node cluster, and limit that at the
> > > node resource level?
> > > Based on the use case you described. My guess is the scheduler creates
> > > Reservations for B-1 and A-3, and when we unreserve these reservations, 
> > > the
> > > FIFO ordering was not strictly honored.
> > > Have you set any queue quota? Here is the doc about how to set the queue
> > > mapping with quota set:
> > >
> http://yunikorn.apache.org/docs/next/user_guide/resource_quota_manageme
> > > nt#namespace-to-queue-mapping.
> > > If we limit the running pods by quota, then I think we will see expected
> > > behavior.
> > > The document about app sorting policy is here:
> > >
> http://yunikorn.apache.org/docs/next/user_guide/sorting_policies#application-
> > > sorting
> > > .
> > >
> > > Weiwei
> > >
> > > On Fri, Feb 5, 2021 at 1:04 PM Patrick, Alton (US)
> > > <[email protected]> wrote:
> > >
> > > > Hi. I have just started using Yunikorn for scheduling with K8s. I have
> > > > been doing some simple experiments to make sure I understand how it
> works.
> > > > Most of them are working as expected, but there is one I don't 
> > > > understand.
> > > >
> > > > I used Helm to deploy Yunikorn and I did not modify anything, so I have
> > > > the default setup: one queue per namespace, and I assume the application
> > > > sort policy is the default of "FifoSortPolicy."
> > > >
> > > > I create four pods. All of them have the resource requests set the same 
> > > > (2
> > > > cores, 1Gi mem), and the resource requests are such that only two can 
> > > > run
> > > > at a time. The pods are created in this order, with a 1 second gap 
> > > > between
> > > > each:
> > > >
> > > > 1. A-1, applicationId = A, sleeps for 10s
> > > > 2. A-2, applicationId = A, sleeps for 5s
> > > > 3. B-1, applicationId = B, sleeps for 5s
> > > > 4. A-3, applicationId = A, sleeps for 5s
> > > >
> > > > What I expect to see is:
> > > > * A-1 is scheduled
> > > > * A-2 is scheduled
> > > > * A-2 finishes
> > > > * A-3 is scheduled (because A is the first application created, as long 
> > > > as
> > > > there are pods in the queue for application A I understand that they 
> > > > should
> > > > have priority over pods for application B)
> > > >
> > > > What I see instead is that after A-2 finishes, B-1 gets scheduled to 
> > > > run.
> > > >
> > > > Is this the expected behavior, and if so can someone explain what is 
> > > > wrong
> > > > with my understanding?
> > > >
> > > > Additionally, in the logs for the scheduler pod, right after pod B-1 
> > > > gets
> > > > scheduled, I see the following messages repeated thousands of times very
> > > > fast (over 2000 instances in about .25s according to timestamps in 
> > > > log). Is
> > > > this normal?
> > > >
> > > > 2021-02-05T20:25:34.479Z DEBUG
> scheduler/scheduling_application.go:641
> > > > skipping node for allocation: basic condition not satisfied {"node":
> > > > "local-node", "allocationKey": "258d9947-e92b-4967-9758-
> 08eee62f4d1b",
> > > > "error": "pre alloc check: requested resource map[memory:1074
> vcore:2000]
> > > > is larger than currently available map[ephemeral-storage:50977832921
> > > > hugepages-2Mi:0 memory:13250 pods:110 vcore:1600] resource on local-
> > > node"}
> > > > 2021-02-05T20:25:34.479Z DEBUG scheduler/scheduling_node.go:271
> > > requested
> > > > resource is larger than currently available node resources {"nodeID":
> > > > "local-node", "requested": "map[memory:1074 vcore:2000]", "available":
> > > > "map[ephemeral-storage:50977832921 hugepages-2Mi:0 memory:13250
> > > pods:110
> > > > vcore:1600]"}
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]

RE: Unexpected scheduler behavior with K8s

Reply via email to