Re: [lng-odp] Scheduler, QUEUES_PER_PRIO and fairness

Ola Liljedahl Tue, 08 Sep 2015 06:35:52 -0700

Sorry I missed this discussion. It is really interesting. IMO the
linux-generic scheduler is too simplistic to be used as is or its behaviour
copied. We have seen some undesirable behaviour in our internal work where
we use ODP. Very simplified, we essentially have two levels of processing,
both fed by the scheduler to the same shared set of CPU's (sometimes only
one CPU). The first level is fed using one queue and the second level is
fed using N queues. Both levels are fed the same amount of packets so the
first level queue transfers N packets while the second level queues
transfer 1 packet each (on the average). The linux-generic scheduler does
round robin over queues, trying to give *equal* attention to each queue.
But in our design, all queues are not equal and the first level queue
should get more attention.


One idea was to specify weights with the queues and do some kind of
weighted queueing. But I don't like the idea of trying to know the optimal
weights and then configure how the scheduler should work. Possibly these
weights are not constant. I think the scheduler needs to dynamically adapt
to the current workload, I think this will be much more robust and it also
requires less configuration by the application (which is always difficult).

I think the scheduler needs to give queues attention corresponding to how
many events they (currently) contain. One potential way of doing that is to
keep scheduling from a queue until it either is empty or you move on to the
next queue with probability 2^-n (n being the number of events on the
queue). The more events on a queue, the longer you serve that queue but
with some increasing probability (as the queue drains) you move on to the
next eligible (non-empty) queue. Over time this should give some kind of
"fair" attention to all queues without stalling indefinitely on some queue
that refuses to drain.

The changes to a scheduler that does unconditional round robin shouldn't be
too complicated.


On 28 August 2015 at 20:08, Bill Fischofer <[email protected]>
wrote:

> The scheduler in linux-generic was written very early on and clearly needs
> a rewrite to be both more robust, modular, and better performing (there's a
> *lot* of locking going on in the scheduler with all the queue enq/deq
> operations).  We will need to revisit this to get a fully production-grade
> SW scheduler for implementations like odp-dpdk that will rely on it.  If
> you have ideas for how to better modularize the SW scheduler we'd be very
> interested in seeing a proposal.  The expectation is that most SoCs either
> have or will have HW schedulers as part of their roadmaps as event-driven
> programming becomes more prevalent (in large part due to ODP adoption,
> hopefully).  So the scheduler APIs, as currently defined, seem adequate,
> though a few adjustments are to be expected over time.
>
> In operating systems, scheduling is usually divided into two parts: A true
> Scheduler, which determines what tasks are eligible to be run, and handles
> things like deadline priority to get RT behavior, and a Dispatcher, that
> selects the highest priority ready task from the list managed by the
> scheduler. The dispatcher runs very frequently, while the scheduler
> operates on a less frequent basis to keep the dispatcher fed and to ensure
> overall fairness. In ODP, the API is called odp_schedule() because from an
> application standpoint it doesn't need a complicated API--it just wants the
> next piece of work.  Internally, however, we need to logically distinguish
> between these two parts, especially if we hope to be able to
> schedule/dispatch millions of events spanning as many queues in SW with any
> efficiency and fairness.
>
> On Fri, Aug 28, 2015 at 9:19 AM, Nicolas Morey-Chaisemartin <
> [email protected]> wrote:
>
>>
>>
>> On 08/28/2015 03:59 PM, Savolainen, Petri (Nokia - FI/Espoo) wrote:
>> >
>> >> -----Original Message-----
>> >> From: ext Nicolas Morey-Chaisemartin [mailto:[email protected]]
>> >> Sent: Friday, August 28, 2015 4:40 PM
>> >> To: Savolainen, Petri (Nokia - FI/Espoo); LNG ODP Mailman List
>> >> Subject: Re: [lng-odp] Scheduler, QUEUES_PER_PRIO and fairness
>> >>
>> >>
>> >>
>> >> On 08/28/2015 01:56 PM, Savolainen, Petri (Nokia - FI/Espoo) wrote:
>> >>>> -----Original Message-----
>> >>>> From: lng-odp [mailto:[email protected]] On Behalf Of
>> >>>> ext Nicolas Morey-Chaisemartin
>> >>>> Sent: Friday, August 28, 2015 11:57 AM
>> >>>> To: LNG ODP Mailman List
>> >>>> Subject: [lng-odp] Scheduler, QUEUES_PER_PRIO and fairness
>> >>>>
>> >>>> Hi all,
>> >>>>
>> >>>> I'm currently diving into the scheduler code from linux-generic to
>> >>>> understand
>> >>>>  how it works and try to write an optimize version for our HW.
>> >>>>
>> >>>> Maybe I missed something, but I think there is a big fairness issue
>> >>>> there.
>> >>>>
>> >>>> Let me sum up what I get. We will stick with only one priority to
>> >> make
>> >>>> it simpler.
>> >>>> We have QUEUES_PER_PRIO queues in the schduler that may contain
>> >>>> commands.
>> >>>> Command are either:
>> >>>>  - Poll a pktio
>> >>>>  - Poll a queue
>> >>>> Both commands are pushed back to the queue if we didn't get all the
>> >>>> pending packet/buffer from them and need to be polled back.
>> >>>>
>> >>>> Threads that call schedule do a round robin scan of all schedule
>> >> queues
>> >>>> starting on the queue threadId % QUEUES_PER_PRIO, jumps to the next
>> >>>> schedule queue if the first schedule commands produced nothing. And
>> >>>> stop on the first schedule commands that produces packet.
>> >>>>
>> >>>> Am I right up to that point?
>> >>>>
>> >>>>
>> >>>> Now let's assume I'm a user unaware of how this works. I have
>> >>>> QUEUES_PER_PRIO at 4. And running ODP with worker 4 threads.
>> >>>> I have created a bunch of queues for my classification so each of
>> >> the
>> >>>> schedule queue as a 1 command to poll a queue in them.
>> >>>>
>> >>>> For some reason I only want to call odp_schedule() from one thread.
>> >>>> Could be that the others are directly polling specific high priority
>> >>>> queues.
>> >>>>
>> >>>> When my thread enters odp_schedule, it starts with the
>> >> schedule_queue =
>> >>>> ( thId % 4). Let's say 0.
>> >>>> Schedule queue 0 also contains the schedule command for my pktio.
>> >>>>
>> >>>> Now let's see what happens
>> >>>> schedule()
>> >>>>    check sched_queue 0
>> >>>>        got sched_cmd to poll pktio
>> >>>>         pktio_poll
>> >>>>             dispatch 1 packet to each of my queues (this traffic is
>> >>>> really well balanced and regular ;) )
>> >>>>         re enqueue the sched_cmd
>> >>>>         continue
>> >>>>    check sched_queue 1
>> >>>>        got a sched_cmd to poll queue #1 ( which now has a packet)
>> >>>>        return packet, requeue sched_cmd
>> >>>>
>> >>>> schedule()
>> >>>>   check sched_queue 0
>> >>>>       got sched_cmd to poll a queue #0 (which also has a packet)
>> >>>>        return packet, requeue sched_cmd
>> >>>>
>> >>>> We are now in the exact same state as we were before the first call.
>> >>>> Except queue #2 and #3 now have 1 packet pending.
>> >>>> This should keep going on until we run out of packet in the pool...
>> >>>>
>> >>>>
>> >>>> So here are a few questions now:
>> >>>> 1) Did I completely miss something and it actually works?
>> >>>> 2) Why do linux-generic needs multiple queues per prio? Is it to
>> >> reduce
>> >>>> contention on the sched_queue ?
>> >>>>
>> >>>> Nicolas
>> >>> Yes, it's optimized for multiple threads (4 or more). Each thread
>> >> start from different (e.g % 4) scheduler queue and keep processing that
>> >> as long as there are events. When thread's "home" sched queue is empty
>> >> it tries it's luck on the next sched queue, and so on. This lowers
>> >> sched queue lock contention (and increases cache hit rate) when threads
>> >> try to serve the same set of queues and load balance only when needed
>> >> (when the home sched queue is empty).
>> >>> -Petri
>> >>>
>> >>>
>> >> The issue I see here is that it introduces a very important design
>> >> constraint in ODP user code that is purely implementation specific.
>> >> It assumes that at least QUEUES_PER_PRIO thread will be using schedule
>> >> regularly and that their ID are different module QUEUES_PER_PRIO.
>> >>
>> >> I could very easily see cases where a user would use schedule from one
>> >> thread and redispatch some of the packets to other neighboring threads
>> >> for additional computations (crypto, CRC, etc.) and even with multiple
>> >> threads using schedule, not necessarily use consecutive one.
>> >> This would impact a lot the fairness and as my example shows even stall
>> >> the application at some point in time.
>> >>
>> >> I guess something could be added to documentation but it's not a very
>> >> good example to give for the "generic" implementation.
>> >>
>> >> Nicolas
>> > It should not stall if system is not overloaded. Thread #0 should move
>> to sched queue #1 as soon as sched queue #0 is empty. If sched queue #0
>> never empties (e.g. you circulate same events back to original queue) and
>> you have only one thread, other sched queues will pile up events until pool
>> runs out. But that's overload then.
>> It does. But if #0 fills back before #1 is empty (ping pong effect like
>> in my example), sched queue #2 will never be checked !
>> The system is not necessary overloaded but just not fair to queues.
>>
>> >
>> > Application should not care about the implementation. Performance may
>> vary between implementations but it should not stall.
>> Agreed.
>>
>> >
>> > If some thread jump between schedule() and other work, you should do
>> pause -> schedule until no more events -> do other stuff -> resume ->
>> schedule ... Otherwise, scheduler may spin on other threads waiting for the
>> thread that exited it's schedule loop.
>> >
>> I don't see how pause/resume have an effect on the overall scheme of
>> things. It allows to empty the cache, but it would not avoid the potential
>> stall.
>>
>> Nicolas
>> _______________________________________________
>> lng-odp mailing list
>> [email protected]
>> https://lists.linaro.org/mailman/listinfo/lng-odp
>>
>
>
> _______________________________________________
> lng-odp mailing list
> [email protected]
> https://lists.linaro.org/mailman/listinfo/lng-odp
>
>

_______________________________________________
lng-odp mailing list
[email protected]
https://lists.linaro.org/mailman/listinfo/lng-odp

Re: [lng-odp] Scheduler, QUEUES_PER_PRIO and fairness

Reply via email to