The scheduler in linux-generic was written very early on and clearly needs
a rewrite to be both more robust, modular, and better performing (there's a
*lot* of locking going on in the scheduler with all the queue enq/deq
operations).  We will need to revisit this to get a fully production-grade
SW scheduler for implementations like odp-dpdk that will rely on it.  If
you have ideas for how to better modularize the SW scheduler we'd be very
interested in seeing a proposal.  The expectation is that most SoCs either
have or will have HW schedulers as part of their roadmaps as event-driven
programming becomes more prevalent (in large part due to ODP adoption,
hopefully).  So the scheduler APIs, as currently defined, seem adequate,
though a few adjustments are to be expected over time.

In operating systems, scheduling is usually divided into two parts: A true
Scheduler, which determines what tasks are eligible to be run, and handles
things like deadline priority to get RT behavior, and a Dispatcher, that
selects the highest priority ready task from the list managed by the
scheduler. The dispatcher runs very frequently, while the scheduler
operates on a less frequent basis to keep the dispatcher fed and to ensure
overall fairness. In ODP, the API is called odp_schedule() because from an
application standpoint it doesn't need a complicated API--it just wants the
next piece of work.  Internally, however, we need to logically distinguish
between these two parts, especially if we hope to be able to
schedule/dispatch millions of events spanning as many queues in SW with any
efficiency and fairness.

On Fri, Aug 28, 2015 at 9:19 AM, Nicolas Morey-Chaisemartin <
[email protected]> wrote:

>
>
> On 08/28/2015 03:59 PM, Savolainen, Petri (Nokia - FI/Espoo) wrote:
> >
> >> -----Original Message-----
> >> From: ext Nicolas Morey-Chaisemartin [mailto:[email protected]]
> >> Sent: Friday, August 28, 2015 4:40 PM
> >> To: Savolainen, Petri (Nokia - FI/Espoo); LNG ODP Mailman List
> >> Subject: Re: [lng-odp] Scheduler, QUEUES_PER_PRIO and fairness
> >>
> >>
> >>
> >> On 08/28/2015 01:56 PM, Savolainen, Petri (Nokia - FI/Espoo) wrote:
> >>>> -----Original Message-----
> >>>> From: lng-odp [mailto:[email protected]] On Behalf Of
> >>>> ext Nicolas Morey-Chaisemartin
> >>>> Sent: Friday, August 28, 2015 11:57 AM
> >>>> To: LNG ODP Mailman List
> >>>> Subject: [lng-odp] Scheduler, QUEUES_PER_PRIO and fairness
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I'm currently diving into the scheduler code from linux-generic to
> >>>> understand
> >>>>  how it works and try to write an optimize version for our HW.
> >>>>
> >>>> Maybe I missed something, but I think there is a big fairness issue
> >>>> there.
> >>>>
> >>>> Let me sum up what I get. We will stick with only one priority to
> >> make
> >>>> it simpler.
> >>>> We have QUEUES_PER_PRIO queues in the schduler that may contain
> >>>> commands.
> >>>> Command are either:
> >>>>  - Poll a pktio
> >>>>  - Poll a queue
> >>>> Both commands are pushed back to the queue if we didn't get all the
> >>>> pending packet/buffer from them and need to be polled back.
> >>>>
> >>>> Threads that call schedule do a round robin scan of all schedule
> >> queues
> >>>> starting on the queue threadId % QUEUES_PER_PRIO, jumps to the next
> >>>> schedule queue if the first schedule commands produced nothing. And
> >>>> stop on the first schedule commands that produces packet.
> >>>>
> >>>> Am I right up to that point?
> >>>>
> >>>>
> >>>> Now let's assume I'm a user unaware of how this works. I have
> >>>> QUEUES_PER_PRIO at 4. And running ODP with worker 4 threads.
> >>>> I have created a bunch of queues for my classification so each of
> >> the
> >>>> schedule queue as a 1 command to poll a queue in them.
> >>>>
> >>>> For some reason I only want to call odp_schedule() from one thread.
> >>>> Could be that the others are directly polling specific high priority
> >>>> queues.
> >>>>
> >>>> When my thread enters odp_schedule, it starts with the
> >> schedule_queue =
> >>>> ( thId % 4). Let's say 0.
> >>>> Schedule queue 0 also contains the schedule command for my pktio.
> >>>>
> >>>> Now let's see what happens
> >>>> schedule()
> >>>>    check sched_queue 0
> >>>>        got sched_cmd to poll pktio
> >>>>         pktio_poll
> >>>>             dispatch 1 packet to each of my queues (this traffic is
> >>>> really well balanced and regular ;) )
> >>>>         re enqueue the sched_cmd
> >>>>         continue
> >>>>    check sched_queue 1
> >>>>        got a sched_cmd to poll queue #1 ( which now has a packet)
> >>>>        return packet, requeue sched_cmd
> >>>>
> >>>> schedule()
> >>>>   check sched_queue 0
> >>>>       got sched_cmd to poll a queue #0 (which also has a packet)
> >>>>        return packet, requeue sched_cmd
> >>>>
> >>>> We are now in the exact same state as we were before the first call.
> >>>> Except queue #2 and #3 now have 1 packet pending.
> >>>> This should keep going on until we run out of packet in the pool...
> >>>>
> >>>>
> >>>> So here are a few questions now:
> >>>> 1) Did I completely miss something and it actually works?
> >>>> 2) Why do linux-generic needs multiple queues per prio? Is it to
> >> reduce
> >>>> contention on the sched_queue ?
> >>>>
> >>>> Nicolas
> >>> Yes, it's optimized for multiple threads (4 or more). Each thread
> >> start from different (e.g % 4) scheduler queue and keep processing that
> >> as long as there are events. When thread's "home" sched queue is empty
> >> it tries it's luck on the next sched queue, and so on. This lowers
> >> sched queue lock contention (and increases cache hit rate) when threads
> >> try to serve the same set of queues and load balance only when needed
> >> (when the home sched queue is empty).
> >>> -Petri
> >>>
> >>>
> >> The issue I see here is that it introduces a very important design
> >> constraint in ODP user code that is purely implementation specific.
> >> It assumes that at least QUEUES_PER_PRIO thread will be using schedule
> >> regularly and that their ID are different module QUEUES_PER_PRIO.
> >>
> >> I could very easily see cases where a user would use schedule from one
> >> thread and redispatch some of the packets to other neighboring threads
> >> for additional computations (crypto, CRC, etc.) and even with multiple
> >> threads using schedule, not necessarily use consecutive one.
> >> This would impact a lot the fairness and as my example shows even stall
> >> the application at some point in time.
> >>
> >> I guess something could be added to documentation but it's not a very
> >> good example to give for the "generic" implementation.
> >>
> >> Nicolas
> > It should not stall if system is not overloaded. Thread #0 should move
> to sched queue #1 as soon as sched queue #0 is empty. If sched queue #0
> never empties (e.g. you circulate same events back to original queue) and
> you have only one thread, other sched queues will pile up events until pool
> runs out. But that's overload then.
> It does. But if #0 fills back before #1 is empty (ping pong effect like in
> my example), sched queue #2 will never be checked !
> The system is not necessary overloaded but just not fair to queues.
>
> >
> > Application should not care about the implementation. Performance may
> vary between implementations but it should not stall.
> Agreed.
>
> >
> > If some thread jump between schedule() and other work, you should do
> pause -> schedule until no more events -> do other stuff -> resume ->
> schedule ... Otherwise, scheduler may spin on other threads waiting for the
> thread that exited it's schedule loop.
> >
> I don't see how pause/resume have an effect on the overall scheme of
> things. It allows to empty the cache, but it would not avoid the potential
> stall.
>
> Nicolas
> _______________________________________________
> lng-odp mailing list
> [email protected]
> https://lists.linaro.org/mailman/listinfo/lng-odp
>
_______________________________________________
lng-odp mailing list
[email protected]
https://lists.linaro.org/mailman/listinfo/lng-odp

Reply via email to