The scheduler in linux-generic was written very early on and clearly needs a rewrite to be both more robust, modular, and better performing (there's a *lot* of locking going on in the scheduler with all the queue enq/deq operations). We will need to revisit this to get a fully production-grade SW scheduler for implementations like odp-dpdk that will rely on it. If you have ideas for how to better modularize the SW scheduler we'd be very interested in seeing a proposal. The expectation is that most SoCs either have or will have HW schedulers as part of their roadmaps as event-driven programming becomes more prevalent (in large part due to ODP adoption, hopefully). So the scheduler APIs, as currently defined, seem adequate, though a few adjustments are to be expected over time.
In operating systems, scheduling is usually divided into two parts: A true Scheduler, which determines what tasks are eligible to be run, and handles things like deadline priority to get RT behavior, and a Dispatcher, that selects the highest priority ready task from the list managed by the scheduler. The dispatcher runs very frequently, while the scheduler operates on a less frequent basis to keep the dispatcher fed and to ensure overall fairness. In ODP, the API is called odp_schedule() because from an application standpoint it doesn't need a complicated API--it just wants the next piece of work. Internally, however, we need to logically distinguish between these two parts, especially if we hope to be able to schedule/dispatch millions of events spanning as many queues in SW with any efficiency and fairness. On Fri, Aug 28, 2015 at 9:19 AM, Nicolas Morey-Chaisemartin < [email protected]> wrote: > > > On 08/28/2015 03:59 PM, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > > >> -----Original Message----- > >> From: ext Nicolas Morey-Chaisemartin [mailto:[email protected]] > >> Sent: Friday, August 28, 2015 4:40 PM > >> To: Savolainen, Petri (Nokia - FI/Espoo); LNG ODP Mailman List > >> Subject: Re: [lng-odp] Scheduler, QUEUES_PER_PRIO and fairness > >> > >> > >> > >> On 08/28/2015 01:56 PM, Savolainen, Petri (Nokia - FI/Espoo) wrote: > >>>> -----Original Message----- > >>>> From: lng-odp [mailto:[email protected]] On Behalf Of > >>>> ext Nicolas Morey-Chaisemartin > >>>> Sent: Friday, August 28, 2015 11:57 AM > >>>> To: LNG ODP Mailman List > >>>> Subject: [lng-odp] Scheduler, QUEUES_PER_PRIO and fairness > >>>> > >>>> Hi all, > >>>> > >>>> I'm currently diving into the scheduler code from linux-generic to > >>>> understand > >>>> how it works and try to write an optimize version for our HW. > >>>> > >>>> Maybe I missed something, but I think there is a big fairness issue > >>>> there. > >>>> > >>>> Let me sum up what I get. We will stick with only one priority to > >> make > >>>> it simpler. > >>>> We have QUEUES_PER_PRIO queues in the schduler that may contain > >>>> commands. > >>>> Command are either: > >>>> - Poll a pktio > >>>> - Poll a queue > >>>> Both commands are pushed back to the queue if we didn't get all the > >>>> pending packet/buffer from them and need to be polled back. > >>>> > >>>> Threads that call schedule do a round robin scan of all schedule > >> queues > >>>> starting on the queue threadId % QUEUES_PER_PRIO, jumps to the next > >>>> schedule queue if the first schedule commands produced nothing. And > >>>> stop on the first schedule commands that produces packet. > >>>> > >>>> Am I right up to that point? > >>>> > >>>> > >>>> Now let's assume I'm a user unaware of how this works. I have > >>>> QUEUES_PER_PRIO at 4. And running ODP with worker 4 threads. > >>>> I have created a bunch of queues for my classification so each of > >> the > >>>> schedule queue as a 1 command to poll a queue in them. > >>>> > >>>> For some reason I only want to call odp_schedule() from one thread. > >>>> Could be that the others are directly polling specific high priority > >>>> queues. > >>>> > >>>> When my thread enters odp_schedule, it starts with the > >> schedule_queue = > >>>> ( thId % 4). Let's say 0. > >>>> Schedule queue 0 also contains the schedule command for my pktio. > >>>> > >>>> Now let's see what happens > >>>> schedule() > >>>> check sched_queue 0 > >>>> got sched_cmd to poll pktio > >>>> pktio_poll > >>>> dispatch 1 packet to each of my queues (this traffic is > >>>> really well balanced and regular ;) ) > >>>> re enqueue the sched_cmd > >>>> continue > >>>> check sched_queue 1 > >>>> got a sched_cmd to poll queue #1 ( which now has a packet) > >>>> return packet, requeue sched_cmd > >>>> > >>>> schedule() > >>>> check sched_queue 0 > >>>> got sched_cmd to poll a queue #0 (which also has a packet) > >>>> return packet, requeue sched_cmd > >>>> > >>>> We are now in the exact same state as we were before the first call. > >>>> Except queue #2 and #3 now have 1 packet pending. > >>>> This should keep going on until we run out of packet in the pool... > >>>> > >>>> > >>>> So here are a few questions now: > >>>> 1) Did I completely miss something and it actually works? > >>>> 2) Why do linux-generic needs multiple queues per prio? Is it to > >> reduce > >>>> contention on the sched_queue ? > >>>> > >>>> Nicolas > >>> Yes, it's optimized for multiple threads (4 or more). Each thread > >> start from different (e.g % 4) scheduler queue and keep processing that > >> as long as there are events. When thread's "home" sched queue is empty > >> it tries it's luck on the next sched queue, and so on. This lowers > >> sched queue lock contention (and increases cache hit rate) when threads > >> try to serve the same set of queues and load balance only when needed > >> (when the home sched queue is empty). > >>> -Petri > >>> > >>> > >> The issue I see here is that it introduces a very important design > >> constraint in ODP user code that is purely implementation specific. > >> It assumes that at least QUEUES_PER_PRIO thread will be using schedule > >> regularly and that their ID are different module QUEUES_PER_PRIO. > >> > >> I could very easily see cases where a user would use schedule from one > >> thread and redispatch some of the packets to other neighboring threads > >> for additional computations (crypto, CRC, etc.) and even with multiple > >> threads using schedule, not necessarily use consecutive one. > >> This would impact a lot the fairness and as my example shows even stall > >> the application at some point in time. > >> > >> I guess something could be added to documentation but it's not a very > >> good example to give for the "generic" implementation. > >> > >> Nicolas > > It should not stall if system is not overloaded. Thread #0 should move > to sched queue #1 as soon as sched queue #0 is empty. If sched queue #0 > never empties (e.g. you circulate same events back to original queue) and > you have only one thread, other sched queues will pile up events until pool > runs out. But that's overload then. > It does. But if #0 fills back before #1 is empty (ping pong effect like in > my example), sched queue #2 will never be checked ! > The system is not necessary overloaded but just not fair to queues. > > > > > Application should not care about the implementation. Performance may > vary between implementations but it should not stall. > Agreed. > > > > > If some thread jump between schedule() and other work, you should do > pause -> schedule until no more events -> do other stuff -> resume -> > schedule ... Otherwise, scheduler may spin on other threads waiting for the > thread that exited it's schedule loop. > > > I don't see how pause/resume have an effect on the overall scheme of > things. It allows to empty the cache, but it would not avoid the potential > stall. > > Nicolas > _______________________________________________ > lng-odp mailing list > [email protected] > https://lists.linaro.org/mailman/listinfo/lng-odp >
_______________________________________________ lng-odp mailing list [email protected] https://lists.linaro.org/mailman/listinfo/lng-odp
