On Thu, Apr 6, 2017 at 1:46 PM, Bill Fischofer
<bill.fischo...@linaro.org> wrote:
> On Thu, Apr 6, 2017 at 1:32 PM, Ola Liljedahl <ola.liljed...@linaro.org> 
> wrote:
>> On 6 April 2017 at 13:48, Jerin Jacob <jerin.ja...@caviumnetworks.com> wrote:
>>> -----Original Message-----
>>>> Date: Thu, 6 Apr 2017 12:54:10 +0200
>>>> From: Ola Liljedahl <ola.liljed...@linaro.org>
>>>> To: Brian Brooks <brian.bro...@arm.com>
>>>> Cc: Jerin Jacob <jerin.ja...@caviumnetworks.com>,
>>>>  "lng-odp@lists.linaro.org" <lng-odp@lists.linaro.org>
>>>> Subject: Re: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software
>>>>  scheduler
>>>>
>>>> On 5 April 2017 at 18:50, Brian Brooks <brian.bro...@arm.com> wrote:
>>>> > On 04/05 21:27:37, Jerin Jacob wrote:
>>>> >> -----Original Message-----
>>>> >> > Date: Tue, 4 Apr 2017 13:47:52 -0500
>>>> >> > From: Brian Brooks <brian.bro...@arm.com>
>>>> >> > To: lng-odp@lists.linaro.org
>>>> >> > Subject: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software 
>>>> >> > scheduler
>>>> >> > X-Mailer: git-send-email 2.12.2
>>>> >> >
>>>> >> > This work derives from Ola Liljedahl's prototype [1] which introduced 
>>>> >> > a
>>>> >> > scalable scheduler design based on primarily lock-free algorithms and
>>>> >> > data structures designed to decrease contention. A thread searches
>>>> >> > through a data structure containing only queues that are both 
>>>> >> > non-empty
>>>> >> > and allowed to be scheduled to that thread. Strict priority 
>>>> >> > scheduling is
>>>> >> > respected, and (W)RR scheduling may be used within queues of the same 
>>>> >> > priority.
>>>> >> > Lastly, pre-scheduling or stashing is not employed since it is 
>>>> >> > optional
>>>> >> > functionality that can be implemented in the application.
>>>> >> >
>>>> >> > In addition to scalable ring buffers, the algorithm also uses 
>>>> >> > unbounded
>>>> >> > concurrent queues. LL/SC and CAS variants exist in cases where 
>>>> >> > absense of
>>>> >> > ABA problem cannot be proved, and also in cases where the compiler's 
>>>> >> > atomic
>>>> >> > built-ins may not be lowered to the desired instruction(s). Finally, 
>>>> >> > a version
>>>> >> > of the algorithm that uses locks is also provided.
>>>> >> >
>>>> >> > See platform/linux-generic/include/odp_config_internal.h for further 
>>>> >> > build
>>>> >> > time configuration.
>>>> >> >
>>>> >> > Use --enable-schedule-scalable to conditionally compile this scheduler
>>>> >> > into the library.
>>>> >>
>>>> >> This is an interesting stuff.
>>>> >>
>>>> >> Do you have any performance/latency numbers in comparison to exiting 
>>>> >> scheduler
>>>> >> for completing say two stage(ORDERED->ATOMIC) or N stage pipeline on 
>>>> >> any platform?
>>>> It is still a SW implementation, there is overhead accessed with queue
>>>> enqueue/dequeue and the scheduling itself.
>>>> So for an N-stage pipeline, overhead will accumulate.
>>>> If only a subset of threads are associated with each stage (this could
>>>> be beneficial for I-cache hit rate), there will be less need for
>>>> scalability.
>>>> What is the recommended strategy here for OCTEON/ThunderX?
>>>
>>> In the view of portable event driven applications(Works on both
>>> embedded and server capable chips), the SW schedule is an important piece.
>>>
>>>> All threads/cores share all work?
>>>
>>> That is the recommend one in HW as it supports nativity. But HW provides
>>> means to partition the work load based on odp schedule groups
>>>
>>>
>>>>
>>>> >
>>>> > To give an idea, the avg latency reported by odp_sched_latency is down 
>>>> > to half
>>>> > that of other schedulers (pre-scheduling/stashing disabled) on 4c A53, 
>>>> > 16c A57,
>>>> > and 12c broadwell. We are still preparing numbers, and I think it's 
>>>> > worth mentioning
>>>> > that they are subject to change as this patch series changes over time.
>>>> >
>>>> > I am not aware of an existing benchmark that involves switching between 
>>>> > different
>>>> > queue types. Perhaps this is happening in an example app?
>>>> This could be useful in e.g. IPsec termination. Use an atomic stage
>>>> for the replay protection check and update. Now ODP has ordered locks
>>>> for that so the "atomic" (exclusive) section can be achieved from an
>>>> ordered processing stage. Perhaps Jerin knows some other application
>>>> that utilises two-stage ORDERED->ATOMIC processing.
>>>
>>> We see ORDERED->ATOMIC as main use case for basic packet forward.Stage
>>> 1(ORDERED) to process on N cores and Stage2(ATOMIC) to maintain the ingress
>>> order.
>> Doesn't ORDERED scheduling maintain the ingress packet order all the
>> way to the egress interface? A least that's my understanding of ODP
>> ordered queues.
>> From an ODP perspective, I fail to see how the ATOMIC stage is needed.
>
> As pointed out earlier, ordered locks are another option to avoid a
> separate processing stage simply to do in-sequence operations within
> an ordered flow. I'd be curious to understand the use-case in a bit
> more detail here. Ordered queues preserve the originating queue's
> order, however to achieve end-to-end ordering involving multiple
> processing stages requires that flows traverse only ordered or atomic
> queues. If a parallel queue is used ordering is indeterminate from
> that point on in the pipeline.
>
>>
>>>
>>>
>>>>
>>>> >
>>>> >> When we say scalable scheduler, What application/means used to quantify
>>>> >> scalablity??
>>>> It starts with the design, use non-blocking data structures and try to
>>>> distribute data to threads so that they do not access shared data very
>>>> often. Some of this is a little detrimental to single-threaded
>>>> performance, you need to use more atomic operations. It seems to work
>>>> well on ARM (A53, A57) though, the penalty is higher on x86 (x86 is
>>>> very good with spin locks, cmpxchg seems to have more overhead
>>>> compared to ldxr/stxr on ARM which can have less memory ordering
>>>> constraints). We actually use different synchronisation strategies on
>>>> ARM and on x86 (compile time configuration).
>>>
>>> Another school of thought is to avoid all the lock using only single 
>>> producer and
>>> single consumer and create N such channels to avoid any sort of locking
>>> primitives for communication.

SPSC queue alleviates the need for an atomic RMW, but does it help
with memory ordering or interconnect behavior?

>> But such N independent channel will limit per-flow throughput per the
>> single-threaded performance of slowest stage (e.g. an individual CPU
>> core). It works great when you have the fastest CPU's, not so great
>> when you have the more power/area efficient CPU cores (which have
>> lower single-threaded performance).
>
> True, however if you have a sufficient number of cores you may be able
> to process dozens (or hundreds) of flows in parallel and as long as
> each flow doesn't require more than a single core's worth of
> processing power you win. Even if you need more than one core, having
> 2 or 4 core clusters process a single flow and then have a dozen or
> more such clusters running independently may be hugely beneficial vs.
> a smaller number of lumbering cores that need to share flows.
>
>>
>>
>>>
>>>>
>>>> You can read more here:
>>>> https://docs.google.com/presentation/d/1BqAdni4aP4aHOqO6fNO39-0MN9zOntI-2ZnVTUXBNSQ
>>>> I also did an internal presentation on the scheduler prototype back at
>>>> Las Vegas, that presentation might also be somewhere on the Linaro web
>>>> site.
>>>
>>> Thanks for the presentation.
>>>
>>>>
>>>>
>>>> >>
>>>> >> Do you have any numbers in comparison to existing scheduler to show
>>>> >> magnitude of the scalablity on any platform?

Reply via email to