Re: [Adeos-main] latency results for ppc and x86

Jan Kiszka Wed, 21 Feb 2007 04:32:16 -0800

Nicholas Mc Guire wrote:
>>>> Latencies are mainly due to cache refills on the P4. Have you already
>>>> put load onto your system? If not, worst case latencies will be even
>>>> longer.
>>>
>>>
>>> one posibility we found in RTLinux/GPL to reduce latency is to free up
>>> TLBs by flushing a few of the TLB hot spots, basically these flushpoints
>>> are something like:
>>>
>>> __asm__ __volatile__("invlpg %0": :"m"
>>> (*(char*)__builtin_return_address(0)));
>>>
>>> put at places where we know we don't need thos lines any more (i.e.
>>> after switching tasks or the like). By inserting only a few such
>>> flushpoints in
>>> hot code on the kernel side we found a clear reduction of the worst case
>>> jitter and interrupt response times.
> 
>> Interesting. Are these flushpoints present in latest kernel patches of
>> RTLinux/GPL? Sounds like a nice thing to play with on a rainy day. :)
> 
> 
> yup - basically if you look at the latest patches (2.4.33-rtl3.2) you
> will find them in the kernel code. Or in the rtlinux core code
> (rtl_core.c and rtl_sched.c). The concept is off course not restricted
> to 2.4.X kernels note thought that some archs (notably MIPS)
> have a problem with __builtin_return_address.


OK, thanks.

> 
> 
>>>
>>> Aside from caches, BTB exhaustion in high load situations is also a
>>> problem that has not been addressed much in the realtime variants - with
>>> the P6 families having a botched BTB prediction unit, one can use some
>>> "strange" constructions to reduce branch penalties - i.e.:
>>>
>>>   if(!condition){slow_path();}
>>>   else{fast_path();}
>>>
>>> if more predictalbe than
>>>
>>>   if(codition){fast_path();}
>>>   else{slow_path();}
> 
>> I think this is also what likely()/unlikely() teaches to the the
>> compiler on x86 (where there is no branch prediction predicate for the
>> instructions), isn't it?
> 
> 
> no not really - likely/unlikely give hints during compilation to relocate
> the unlikey part to a distant location (some lable at the end of the
> file...) but that does not change the rpoblem at runtime with respect to
> the worst case. The BTB uses a hysteresis of one miss/hit to adjust the
> guess on P6 systems with the default (if the address is not present in
> the BTB) of not taken - thus if you reorder for the "not taken" case
> being the fast patch you will always have the fast path preloaded in
> the pipeline.
> 
> if(likley(condition)){
>    fast_patch();
> else
>    slow_path();
> 
> will be fast on average but the worst case is that the address is not
> in the BTB so the slow_patch() tag is loaded by default.

Ah, got the idea. How much arch/processor-type-dependent is this
optimisation? It would surely makes no sense to optimise for arch X in
generic code.

> 
> There is a paper on this (a bit messy) published at RTLWS7 (Lile) 2005
> if you are interested in the details.
> 
>>>
>>> as in the first case the branch prediction is static, thus the worst
>>> case
>>> is that you are jumping over a few bytes of object code when the
>>> condition
>>> is not met. in the second case the default if the BTB does not yet know
>>> this branch is to guess not-taken and thus load the jump target of the
>>> slow patch with the overhead of TLB/Cache penalties.
>>>
>>> Regarding the PPC numbers, the surprising thing for me is that the same
>>> archs are doing MUCH better with old RTAI/RTLinux versions, i.e. 2.4.4
>>> kernel on a 50MHz MPC860 shows a worst case of 57us - so I do question
>>> what is going wrong here in the 2.6.X branches of hard-realtime Linux -
> 
>> You forget that old stuff was kernel-only, lacking a lot of Linux
>> integration features. Recent I-pipe-based real-time via Xenomai normally
>> includes support for user-space RT (you can switch it off, but hardly
>> anyone does). So its not a useful comparison given that new real-time
>> projects almost always want full-featured user space these days. For a
>> fairer comparison, one should consider a simple I-pipe domain that
>> contains the real-time "application".
> 
> 
> note that the numbers posted here WERE kernel numbers !

But with user space support enabled. There are no separate code paths
for kernel and user space threads, basic infrastructure is shared here
for good reasons.

> I know that people want to move to user-space - but what is the advantage
> over RT-preempt then if you use the dynamic tick patch (scheduled to go
> mainline in 2.6.21 BTW) ?

So far, determinism (both /wrt mainline and latest -rt).

BTW, kernel space real time is specifically no longer recommendable for
commercial projects that have to worry about the (likely non-GPL)
license of their application code. And then there are those countless
technical advantages that speed up the development process of user space
apps.

> 
>>> my suspicion is that there is too much work being done on fast-hot CPUs
>>> and the low-end is being neglected - which is bad as the numbers you
>>> post here for ADEOS are numbers reachable with mainstream preemptive
>>> kernel by now as well (off course not on the low end systems though).
> 
>> That's scenario-dependent. Simple setups like a plain timed task can
>> reach the dimension of I-pipe-based Xenomai, but more complex scenarios
>> suffer from the exploding complexity in mainstream Linux, even with -rt.
>> Just think of "simple" mutexes realised via futexes.
> 
> 
> do you have some code samples with numbers ? I would be very interested in
> a demo that shows this problem - I was not able to really find a smoking
> gun with RT-preempt and dynamic ticks (2.6.17.2).

I can't help with demo code, but I can name a few conceptual issues:

 o Futexes may require to allocate memory when suspending on a contented
   lock (refill_pi_state_cache)
 o Futexes depend on mmap_sem
 o Preemptible RCU read-sides can either lead to OOM or require
   intrusive read-side priority boosting (see Paul McKenney's LWN
   article)
 o Excessive lock nesting depths in critical code paths makes it hard to
   predict worst-case behaviour (or to verify that measurements actually
   already triggered them)
 o Any nanosleep&friends-using Linux process can schedule hrtimers at
   arbitrary dates, requiring to have a pretty close look at the
   (worst-case) timer usage pattern of the _whole_ system, not only the
   SCHED_FIFO/RR part

That's what I can tell from the heart. But one would have to analyse the
code more thoroughly I guess.

Jan

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Adeos-main mailing list
[email protected]
https://mail.gna.org/listinfo/adeos-main

Re: [Adeos-main] latency results for ppc and x86

Reply via email to