On 6 April 2017 at 13:48, Jerin Jacob <[email protected]> wrote: > -----Original Message----- >> Date: Thu, 6 Apr 2017 12:54:10 +0200 >> From: Ola Liljedahl <[email protected]> >> To: Brian Brooks <[email protected]> >> Cc: Jerin Jacob <[email protected]>, >> "[email protected]" <[email protected]> >> Subject: Re: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software >> scheduler >> >> On 5 April 2017 at 18:50, Brian Brooks <[email protected]> wrote: >> > On 04/05 21:27:37, Jerin Jacob wrote: >> >> -----Original Message----- >> >> > Date: Tue, 4 Apr 2017 13:47:52 -0500 >> >> > From: Brian Brooks <[email protected]> >> >> > To: [email protected] >> >> > Subject: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software >> >> > scheduler >> >> > X-Mailer: git-send-email 2.12.2 >> >> > >> >> > This work derives from Ola Liljedahl's prototype [1] which introduced a >> >> > scalable scheduler design based on primarily lock-free algorithms and >> >> > data structures designed to decrease contention. A thread searches >> >> > through a data structure containing only queues that are both non-empty >> >> > and allowed to be scheduled to that thread. Strict priority scheduling >> >> > is >> >> > respected, and (W)RR scheduling may be used within queues of the same >> >> > priority. >> >> > Lastly, pre-scheduling or stashing is not employed since it is optional >> >> > functionality that can be implemented in the application. >> >> > >> >> > In addition to scalable ring buffers, the algorithm also uses unbounded >> >> > concurrent queues. LL/SC and CAS variants exist in cases where absense >> >> > of >> >> > ABA problem cannot be proved, and also in cases where the compiler's >> >> > atomic >> >> > built-ins may not be lowered to the desired instruction(s). Finally, a >> >> > version >> >> > of the algorithm that uses locks is also provided. >> >> > >> >> > See platform/linux-generic/include/odp_config_internal.h for further >> >> > build >> >> > time configuration. >> >> > >> >> > Use --enable-schedule-scalable to conditionally compile this scheduler >> >> > into the library. >> >> >> >> This is an interesting stuff. >> >> >> >> Do you have any performance/latency numbers in comparison to exiting >> >> scheduler >> >> for completing say two stage(ORDERED->ATOMIC) or N stage pipeline on any >> >> platform? >> It is still a SW implementation, there is overhead accessed with queue >> enqueue/dequeue and the scheduling itself. >> So for an N-stage pipeline, overhead will accumulate. >> If only a subset of threads are associated with each stage (this could >> be beneficial for I-cache hit rate), there will be less need for >> scalability. >> What is the recommended strategy here for OCTEON/ThunderX? > > In the view of portable event driven applications(Works on both > embedded and server capable chips), the SW schedule is an important piece. > >> All threads/cores share all work? > > That is the recommend one in HW as it supports nativity. But HW provides > means to partition the work load based on odp schedule groups > > >> >> > >> > To give an idea, the avg latency reported by odp_sched_latency is down to >> > half >> > that of other schedulers (pre-scheduling/stashing disabled) on 4c A53, 16c >> > A57, >> > and 12c broadwell. We are still preparing numbers, and I think it's worth >> > mentioning >> > that they are subject to change as this patch series changes over time. >> > >> > I am not aware of an existing benchmark that involves switching between >> > different >> > queue types. Perhaps this is happening in an example app? >> This could be useful in e.g. IPsec termination. Use an atomic stage >> for the replay protection check and update. Now ODP has ordered locks >> for that so the "atomic" (exclusive) section can be achieved from an >> ordered processing stage. Perhaps Jerin knows some other application >> that utilises two-stage ORDERED->ATOMIC processing. > > We see ORDERED->ATOMIC as main use case for basic packet forward.Stage > 1(ORDERED) to process on N cores and Stage2(ATOMIC) to maintain the ingress > order. Doesn't ORDERED scheduling maintain the ingress packet order all the way to the egress interface? A least that's my understanding of ODP ordered queues. >From an ODP perspective, I fail to see how the ATOMIC stage is needed.
> > >> >> > >> >> When we say scalable scheduler, What application/means used to quantify >> >> scalablity?? >> It starts with the design, use non-blocking data structures and try to >> distribute data to threads so that they do not access shared data very >> often. Some of this is a little detrimental to single-threaded >> performance, you need to use more atomic operations. It seems to work >> well on ARM (A53, A57) though, the penalty is higher on x86 (x86 is >> very good with spin locks, cmpxchg seems to have more overhead >> compared to ldxr/stxr on ARM which can have less memory ordering >> constraints). We actually use different synchronisation strategies on >> ARM and on x86 (compile time configuration). > > Another school of thought is to avoid all the lock using only single producer > and > single consumer and create N such channels to avoid any sort of locking > primitives for communication. But such N independent channel will limit per-flow throughput per the single-threaded performance of slowest stage (e.g. an individual CPU core). It works great when you have the fastest CPU's, not so great when you have the more power/area efficient CPU cores (which have lower single-threaded performance). > >> >> You can read more here: >> https://docs.google.com/presentation/d/1BqAdni4aP4aHOqO6fNO39-0MN9zOntI-2ZnVTUXBNSQ >> I also did an internal presentation on the scheduler prototype back at >> Las Vegas, that presentation might also be somewhere on the Linaro web >> site. > > Thanks for the presentation. > >> >> >> >> >> >> Do you have any numbers in comparison to existing scheduler to show >> >> magnitude of the scalablity on any platform?
