On 5 April 2017 at 18:50, Brian Brooks <[email protected]> wrote: > On 04/05 21:27:37, Jerin Jacob wrote: >> -----Original Message----- >> > Date: Tue, 4 Apr 2017 13:47:52 -0500 >> > From: Brian Brooks <[email protected]> >> > To: [email protected] >> > Subject: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software scheduler >> > X-Mailer: git-send-email 2.12.2 >> > >> > This work derives from Ola Liljedahl's prototype [1] which introduced a >> > scalable scheduler design based on primarily lock-free algorithms and >> > data structures designed to decrease contention. A thread searches >> > through a data structure containing only queues that are both non-empty >> > and allowed to be scheduled to that thread. Strict priority scheduling is >> > respected, and (W)RR scheduling may be used within queues of the same >> > priority. >> > Lastly, pre-scheduling or stashing is not employed since it is optional >> > functionality that can be implemented in the application. >> > >> > In addition to scalable ring buffers, the algorithm also uses unbounded >> > concurrent queues. LL/SC and CAS variants exist in cases where absense of >> > ABA problem cannot be proved, and also in cases where the compiler's atomic >> > built-ins may not be lowered to the desired instruction(s). Finally, a >> > version >> > of the algorithm that uses locks is also provided. >> > >> > See platform/linux-generic/include/odp_config_internal.h for further build >> > time configuration. >> > >> > Use --enable-schedule-scalable to conditionally compile this scheduler >> > into the library. >> >> This is an interesting stuff. >> >> Do you have any performance/latency numbers in comparison to exiting >> scheduler >> for completing say two stage(ORDERED->ATOMIC) or N stage pipeline on any >> platform? It is still a SW implementation, there is overhead accessed with queue enqueue/dequeue and the scheduling itself. So for an N-stage pipeline, overhead will accumulate. If only a subset of threads are associated with each stage (this could be beneficial for I-cache hit rate), there will be less need for scalability. What is the recommended strategy here for OCTEON/ThunderX? All threads/cores share all work?
> > To give an idea, the avg latency reported by odp_sched_latency is down to half > that of other schedulers (pre-scheduling/stashing disabled) on 4c A53, 16c > A57, > and 12c broadwell. We are still preparing numbers, and I think it's worth > mentioning > that they are subject to change as this patch series changes over time. > > I am not aware of an existing benchmark that involves switching between > different > queue types. Perhaps this is happening in an example app? This could be useful in e.g. IPsec termination. Use an atomic stage for the replay protection check and update. Now ODP has ordered locks for that so the "atomic" (exclusive) section can be achieved from an ordered processing stage. Perhaps Jerin knows some other application that utilises two-stage ORDERED->ATOMIC processing. > >> When we say scalable scheduler, What application/means used to quantify >> scalablity?? It starts with the design, use non-blocking data structures and try to distribute data to threads so that they do not access shared data very often. Some of this is a little detrimental to single-threaded performance, you need to use more atomic operations. It seems to work well on ARM (A53, A57) though, the penalty is higher on x86 (x86 is very good with spin locks, cmpxchg seems to have more overhead compared to ldxr/stxr on ARM which can have less memory ordering constraints). We actually use different synchronisation strategies on ARM and on x86 (compile time configuration). You can read more here: https://docs.google.com/presentation/d/1BqAdni4aP4aHOqO6fNO39-0MN9zOntI-2ZnVTUXBNSQ I also did an internal presentation on the scheduler prototype back at Las Vegas, that presentation might also be somewhere on the Linaro web site. >> >> Do you have any numbers in comparison to existing scheduler to show >> magnitude of the scalablity on any platform?
