On 6 April 2017 at 13:48, Jerin Jacob <[email protected]> wrote:
> -----Original Message-----
>> Date: Thu, 6 Apr 2017 12:54:10 +0200
>> From: Ola Liljedahl <[email protected]>
>> To: Brian Brooks <[email protected]>
>> Cc: Jerin Jacob <[email protected]>,
>>  "[email protected]" <[email protected]>
>> Subject: Re: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software
>>  scheduler
>>
>> On 5 April 2017 at 18:50, Brian Brooks <[email protected]> wrote:
>> > On 04/05 21:27:37, Jerin Jacob wrote:
>> >> -----Original Message-----
>> >> > Date: Tue, 4 Apr 2017 13:47:52 -0500
>> >> > From: Brian Brooks <[email protected]>
>> >> > To: [email protected]
>> >> > Subject: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software 
>> >> > scheduler
>> >> > X-Mailer: git-send-email 2.12.2
>> >> >
>> >> > This work derives from Ola Liljedahl's prototype [1] which introduced a
>> >> > scalable scheduler design based on primarily lock-free algorithms and
>> >> > data structures designed to decrease contention. A thread searches
>> >> > through a data structure containing only queues that are both non-empty
>> >> > and allowed to be scheduled to that thread. Strict priority scheduling 
>> >> > is
>> >> > respected, and (W)RR scheduling may be used within queues of the same 
>> >> > priority.
>> >> > Lastly, pre-scheduling or stashing is not employed since it is optional
>> >> > functionality that can be implemented in the application.
>> >> >
>> >> > In addition to scalable ring buffers, the algorithm also uses unbounded
>> >> > concurrent queues. LL/SC and CAS variants exist in cases where absense 
>> >> > of
>> >> > ABA problem cannot be proved, and also in cases where the compiler's 
>> >> > atomic
>> >> > built-ins may not be lowered to the desired instruction(s). Finally, a 
>> >> > version
>> >> > of the algorithm that uses locks is also provided.
>> >> >
>> >> > See platform/linux-generic/include/odp_config_internal.h for further 
>> >> > build
>> >> > time configuration.
>> >> >
>> >> > Use --enable-schedule-scalable to conditionally compile this scheduler
>> >> > into the library.
>> >>
>> >> This is an interesting stuff.
>> >>
>> >> Do you have any performance/latency numbers in comparison to exiting 
>> >> scheduler
>> >> for completing say two stage(ORDERED->ATOMIC) or N stage pipeline on any 
>> >> platform?
>> It is still a SW implementation, there is overhead accessed with queue
>> enqueue/dequeue and the scheduling itself.
>> So for an N-stage pipeline, overhead will accumulate.
>> If only a subset of threads are associated with each stage (this could
>> be beneficial for I-cache hit rate), there will be less need for
>> scalability.
>> What is the recommended strategy here for OCTEON/ThunderX?
>
> In the view of portable event driven applications(Works on both
> embedded and server capable chips), the SW schedule is an important piece.
>
>> All threads/cores share all work?
>
> That is the recommend one in HW as it supports nativity. But HW provides
> means to partition the work load based on odp schedule groups
>
>
>>
>> >
>> > To give an idea, the avg latency reported by odp_sched_latency is down to 
>> > half
>> > that of other schedulers (pre-scheduling/stashing disabled) on 4c A53, 16c 
>> > A57,
>> > and 12c broadwell. We are still preparing numbers, and I think it's worth 
>> > mentioning
>> > that they are subject to change as this patch series changes over time.
>> >
>> > I am not aware of an existing benchmark that involves switching between 
>> > different
>> > queue types. Perhaps this is happening in an example app?
>> This could be useful in e.g. IPsec termination. Use an atomic stage
>> for the replay protection check and update. Now ODP has ordered locks
>> for that so the "atomic" (exclusive) section can be achieved from an
>> ordered processing stage. Perhaps Jerin knows some other application
>> that utilises two-stage ORDERED->ATOMIC processing.
>
> We see ORDERED->ATOMIC as main use case for basic packet forward.Stage
> 1(ORDERED) to process on N cores and Stage2(ATOMIC) to maintain the ingress
> order.
Doesn't ORDERED scheduling maintain the ingress packet order all the
way to the egress interface? A least that's my understanding of ODP
ordered queues.
>From an ODP perspective, I fail to see how the ATOMIC stage is needed.

>
>
>>
>> >
>> >> When we say scalable scheduler, What application/means used to quantify
>> >> scalablity??
>> It starts with the design, use non-blocking data structures and try to
>> distribute data to threads so that they do not access shared data very
>> often. Some of this is a little detrimental to single-threaded
>> performance, you need to use more atomic operations. It seems to work
>> well on ARM (A53, A57) though, the penalty is higher on x86 (x86 is
>> very good with spin locks, cmpxchg seems to have more overhead
>> compared to ldxr/stxr on ARM which can have less memory ordering
>> constraints). We actually use different synchronisation strategies on
>> ARM and on x86 (compile time configuration).
>
> Another school of thought is to avoid all the lock using only single producer 
> and
> single consumer and create N such channels to avoid any sort of locking
> primitives for communication.
But such N independent channel will limit per-flow throughput per the
single-threaded performance of slowest stage (e.g. an individual CPU
core). It works great when you have the fastest CPU's, not so great
when you have the more power/area efficient CPU cores (which have
lower single-threaded performance).


>
>>
>> You can read more here:
>> https://docs.google.com/presentation/d/1BqAdni4aP4aHOqO6fNO39-0MN9zOntI-2ZnVTUXBNSQ
>> I also did an internal presentation on the scheduler prototype back at
>> Las Vegas, that presentation might also be somewhere on the Linaro web
>> site.
>
> Thanks for the presentation.
>
>>
>>
>> >>
>> >> Do you have any numbers in comparison to existing scheduler to show
>> >> magnitude of the scalablity on any platform?

Reply via email to