Re: Unexpected scheduler behavior with K8s

Weiwei Yang Tue, 09 Feb 2021 09:59:28 -0800

Hi Patrick

Sorry for the confusion. What I meant to say was the following case:
1. submit job A
2. submit job B
3. Both A and B are running to finish (all pods terminated and deleted)
4. After sometime
5. Submit job B
6. Submit job A


At this point you are expecting to see job B gets scheduled first, however,
since inside of the scheduler, YK still thinks A and B are running (due to
the limitation I mentioned in the previous comment), it still schedules A
first then B second. But if you restart the scheduler, I think you'll see
expected behavior, because the scheduler sees B first then A.

On Tue, Feb 9, 2021 at 9:02 AM Patrick, Alton (US)
<[email protected]> wrote:

> I'm sorry, I don't understand this. The behavior you are describing ("The
> FIFO order honors the old app “A” order") is the behavior I was expecting,
> but not what I'm seeing. I would think that means that any pod with appID A
> should always be scheduled before a pod with appID B, since application A
> was created first. But I'm seeing that sometimes pods with appID A get
> scheduled after pods with appID B.
>
> > -----Original Message-----
> > From: Weiwei Yang <[email protected]>
> > Sent: Saturday, February 6, 2021 7:36 PM
> > To: [email protected]; [email protected]
> > Subject: RE: Unexpected scheduler behavior with K8s
> >
> > Hi Patrick
> >
> > I assume you are using YuniKorn 0.9 build, we have a limitation in this
> version.
> > When the job finishes, inside of the scheduler, it was not considered as
> > completed accordingly. This is due to in K8s we have no where to track
> app
> > state, YuniKorn tracks it purely on grouping pods by applicationID
> label. That
> > means, if you submit pods with appID “A”,  pods finished and after
> sometime
> > you submit the pods with ID “A” again, the scheduler still considers it
> as the old
> > app “A”. The FIFO order honors the old app “A” order. This is probably
> the
> > reason why you see a different result after restart.
> >
> > On the upcoming release, we are trying to fix this
> > in https://issues.apache.org/jira/browse/YUNIKORN-483. We haven’t tried
> K3s
> > before, but personally I don’t think that matters.
> >
> > Hope this helps
> > Thanks
> >
> > --
> > Weiwei
> > On Feb 5, 2021, 6:56 PM -0800, Patrick, Alton (US)
> > <[email protected]>, wrote:
> > > Thanks for your answer. You are right -- this is a single-node
> cluster, and I was
> > just relying on the limits on the node. I was not setting a queue quota.
> > >
> > > You are also right that when I use annotations on the namespace to set
> queue
> > quotas, it works as expected: A-3 gets scheduled before B-1.
> > >
> > > While trying this some more, I also noticed that even when I do *not*
> set a
> > queue quota, the order is as-expected (A-3, then B-1) after restarting
> Yunikorn.
> > But on subsequent runs, the order is always B-1, then A-3.
> > >
> > > Is it required to have queue quotas to make this work in the expected
> way? Or
> > should it also work if resources are only constrained at the node or
> cluster level?
> > >
> > > I should probably mention: I am doing this in a cluster created with
> Rancher
> > 2.5.5 and running k3s v1.18.8+k3s1. I don’t know if that matters or not.
> > >
> > > > -----Original Message-----
> > > > From: Weiwei Yang <[email protected]>
> > > > Sent: Friday, February 5, 2021 4:47 PM
> > > > To: [email protected]
> > > > Subject: Re: Unexpected scheduler behavior with K8s
> > > >
> > > > *** WARNING ***
> > > > EXTERNAL EMAIL -- This message originates from outside our
> organization.
> > > >
> > > >
> > > > Hi Patrick
> > > >
> > > > Thanks for reaching out.
> > > > I have one question about how you limit there is only 2 pod running
> at a
> > > > time. Are you running this on a single node cluster, and limit that
> at the
> > > > node resource level?
> > > > Based on the use case you described. My guess is the scheduler
> creates
> > > > Reservations for B-1 and A-3, and when we unreserve these
> reservations, the
> > > > FIFO ordering was not strictly honored.
> > > > Have you set any queue quota? Here is the doc about how to set the
> queue
> > > > mapping with quota set:
> > > >
> > http://yunikorn.apache.org/docs/next/user_guide/resource_quota_manageme
> > > > nt#namespace-to-queue-mapping.
> > > > If we limit the running pods by quota, then I think we will see
> expected
> > > > behavior.
> > > > The document about app sorting policy is here:
> > > >
> >
> http://yunikorn.apache.org/docs/next/user_guide/sorting_policies#application-
> > > > sorting
> > > > .
> > > >
> > > > Weiwei
> > > >
> > > > On Fri, Feb 5, 2021 at 1:04 PM Patrick, Alton (US)
> > > > <[email protected]> wrote:
> > > >
> > > > > Hi. I have just started using Yunikorn for scheduling with K8s. I
> have
> > > > > been doing some simple experiments to make sure I understand how it
> > works.
> > > > > Most of them are working as expected, but there is one I don't
> understand.
> > > > >
> > > > > I used Helm to deploy Yunikorn and I did not modify anything, so I
> have
> > > > > the default setup: one queue per namespace, and I assume the
> application
> > > > > sort policy is the default of "FifoSortPolicy."
> > > > >
> > > > > I create four pods. All of them have the resource requests set the
> same (2
> > > > > cores, 1Gi mem), and the resource requests are such that only two
> can run
> > > > > at a time. The pods are created in this order, with a 1 second gap
> between
> > > > > each:
> > > > >
> > > > > 1. A-1, applicationId = A, sleeps for 10s
> > > > > 2. A-2, applicationId = A, sleeps for 5s
> > > > > 3. B-1, applicationId = B, sleeps for 5s
> > > > > 4. A-3, applicationId = A, sleeps for 5s
> > > > >
> > > > > What I expect to see is:
> > > > > * A-1 is scheduled
> > > > > * A-2 is scheduled
> > > > > * A-2 finishes
> > > > > * A-3 is scheduled (because A is the first application created, as
> long as
> > > > > there are pods in the queue for application A I understand that
> they should
> > > > > have priority over pods for application B)
> > > > >
> > > > > What I see instead is that after A-2 finishes, B-1 gets scheduled
> to run.
> > > > >
> > > > > Is this the expected behavior, and if so can someone explain what
> is wrong
> > > > > with my understanding?
> > > > >
> > > > > Additionally, in the logs for the scheduler pod, right after pod
> B-1 gets
> > > > > scheduled, I see the following messages repeated thousands of
> times very
> > > > > fast (over 2000 instances in about .25s according to timestamps in
> log). Is
> > > > > this normal?
> > > > >
> > > > > 2021-02-05T20:25:34.479Z DEBUG
> > scheduler/scheduling_application.go:641
> > > > > skipping node for allocation: basic condition not satisfied
> {"node":
> > > > > "local-node", "allocationKey": "258d9947-e92b-4967-9758-
> > 08eee62f4d1b",
> > > > > "error": "pre alloc check: requested resource map[memory:1074
> > vcore:2000]
> > > > > is larger than currently available
> map[ephemeral-storage:50977832921
> > > > > hugepages-2Mi:0 memory:13250 pods:110 vcore:1600] resource on
> local-
> > > > node"}
> > > > > 2021-02-05T20:25:34.479Z DEBUG scheduler/scheduling_node.go:271
> > > > requested
> > > > > resource is larger than currently available node resources
> {"nodeID":
> > > > > "local-node", "requested": "map[memory:1074 vcore:2000]",
> "available":
> > > > > "map[ephemeral-storage:50977832921 hugepages-2Mi:0 memory:13250
> > > > pods:110
> > > > > vcore:1600]"}
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [email protected]
> > > > > For additional commands, e-mail: [email protected]
> > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
>

Re: Unexpected scheduler behavior with K8s

Reply via email to