Re: Time spent in ticks...

2016-10-18 Thread Sebastian Huber


- Am 18. Okt 2016 um 18:03 schrieb Jakob Viketoft 
jakob.viket...@aacmicrotec.com:

> Hello Pavel, Joel, Sebastian,
> 
> From: Pavel Pisa [ppisa4li...@pikron.com]
> Sent: Thursday, October 13, 2016 19:09
> To: devel@rtems.org
> Cc: Jakob Viketoft; j...@rtems.org
> Subject: Re: Time spent in ticks...
> 
>> Hello Jakob,
> 
>> ...
> 
>> the time is measured and timers queue use 64-bit types for time
>> representation. When higher time measurement resolution than tick
>> is requested then it is reasonable (optimal) choice but it can be problem
>> for 16-bit CPUs and some 32-bit one as well.
> 
>> How you have configured or1k CPU? Have you available hardware multiplier
>> and barrel shifter or only shift by one and multiplier in SW?
>> Do the CFLAGS match available instructions?
> 
>> I am not sure, if there is not 64 division in the time computation
>> either. This is would be a killer for your CPU. The high resolution
>> time sources and even tickless timers support can be implemented
>> with full scaling and adjustment with only shifts, addition and
>> multiplications in hot paths.
> 
>> I have tried to understand to actual RTEMS time-keeping code
>> some time ago when nanosleep has been introduced and
>> I have tried to analyze, propose some changes and compared
>> it to Linux. See the thread following next messages
> 
>>  https://lists.rtems.org/pipermail/devel/2016-August/015720.html
> 
>>  https://lists.rtems.org/pipermail/devel/2016-August/015721.html
> 
>> Some discussed changes to nanosleep has been been implemented
>> already.
> 
>> Generally, try to measure how many times multiplication
>> and division is called in ISR.
>> I think that I am capable to design implementation which
>> restricted to mul, add and shr and minimizes number
>> of transformations but if it sis found that RTEMS implementation
>> needs to be optimized/changed then it can be task counted
>> in man months.
> 
>> Generally, if tick interrupt last more than 10 (may be 20) usec then
>> there is problem. One its source can be SW implementation ineffectiveness
>> other that OS selected and possibly application required features
>> are above selected CPU capabilities.
> 
> Sorry for my late response, I got caught on another hook for a couple of days
> but have now been able to wriggle free and delve deeper into the problem. 
> First
> off, let me say that our or1k is configured to have both multiplier and
> division units and I can see that the toolchain match as they get implemented
> in the code (I can search for these generated instructions in a dump). 
> However,
> for 64-bit multiplication and division, there is no matching hardware and 
> these
> are implemented in software. The problematic code in our case is part of the
> tick code, in function tc_windup() in file cpukit/score/src/kern_tc.c.
> 
> Going from Joels clues about the erc32 and its timing, I looked into this a 
> bit
> more and compared the assembler level to see what it made of the same
> Clock_isr. I found that in the erc32 case there is an overriding definition of
> Clock_driver_timecounter_tick() which ultimately leads it to use
> _Timecounter_Tick_simple where we were using the default _Timecounter_Tick.
> Now, this obviously won't hit the same speed bump and I believe going this way
> makes more sense for our CPU.
> 
> I just wanted to make sure that we don't lose any functionality or limit
> ourselves too much by going this route. Any comments or thoughts on this?
> Regarding CPU-features, the erc32 and or1k seem to be quite similar and should
> perhaps also have a more similar BSP implementation. Please let me know if I'm
> dead wrong... :)

The simple timercounter tick is there to support badly designed hardware and is 
less efficient than the normal timecounter tick.
___
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel


RE: Time spent in ticks...

2016-10-18 Thread Jakob Viketoft
Hello Pavel, Joel, Sebastian,

From: Pavel Pisa [ppisa4li...@pikron.com]
Sent: Thursday, October 13, 2016 19:09
To: devel@rtems.org
Cc: Jakob Viketoft; j...@rtems.org
Subject: Re: Time spent in ticks...

> Hello Jakob,

> ...

> the time is measured and timers queue use 64-bit types for time
> representation. When higher time measurement resolution than tick
> is requested then it is reasonable (optimal) choice but it can be problem
> for 16-bit CPUs and some 32-bit one as well.

> How you have configured or1k CPU? Have you available hardware multiplier
> and barrel shifter or only shift by one and multiplier in SW?
> Do the CFLAGS match available instructions?

> I am not sure, if there is not 64 division in the time computation
> either. This is would be a killer for your CPU. The high resolution
> time sources and even tickless timers support can be implemented
> with full scaling and adjustment with only shifts, addition and
> multiplications in hot paths.

> I have tried to understand to actual RTEMS time-keeping code
> some time ago when nanosleep has been introduced and
> I have tried to analyze, propose some changes and compared
> it to Linux. See the thread following next messages

>  https://lists.rtems.org/pipermail/devel/2016-August/015720.html

>  https://lists.rtems.org/pipermail/devel/2016-August/015721.html

> Some discussed changes to nanosleep has been been implemented
> already.

> Generally, try to measure how many times multiplication
> and division is called in ISR.
> I think that I am capable to design implementation which
> restricted to mul, add and shr and minimizes number
> of transformations but if it sis found that RTEMS implementation
> needs to be optimized/changed then it can be task counted
> in man months.

> Generally, if tick interrupt last more than 10 (may be 20) usec then
> there is problem. One its source can be SW implementation ineffectiveness
> other that OS selected and possibly application required features
> are above selected CPU capabilities.

Sorry for my late response, I got caught on another hook for a couple of days 
but have now been able to wriggle free and delve deeper into the problem. First 
off, let me say that our or1k is configured to have both multiplier and 
division units and I can see that the toolchain match as they get implemented 
in the code (I can search for these generated instructions in a dump). However, 
for 64-bit multiplication and division, there is no matching hardware and these 
are implemented in software. The problematic code in our case is part of the 
tick code, in function tc_windup() in file cpukit/score/src/kern_tc.c.

Going from Joels clues about the erc32 and its timing, I looked into this a bit 
more and compared the assembler level to see what it made of the same 
Clock_isr. I found that in the erc32 case there is an overriding definition of 
Clock_driver_timecounter_tick() which ultimately leads it to use 
_Timecounter_Tick_simple where we were using the default _Timecounter_Tick. 
Now, this obviously won't hit the same speed bump and I believe going this way 
makes more sense for our CPU.

I just wanted to make sure that we don't lose any functionality or limit 
ourselves too much by going this route. Any comments or thoughts on this? 
Regarding CPU-features, the erc32 and or1k seem to be quite similar and should 
perhaps also have a more similar BSP implementation. Please let me know if I'm 
dead wrong... :)

/Jakob
___
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel


Re: Time spent in ticks...

2016-10-14 Thread Sebastian Huber


- Am 13. Okt 2016 um 18:21 schrieb Jakob Viketoft 
jakob.viket...@aacmicrotec.com:

[...]
> Even though _Watchdog_Tick() "only" takes ~100 us now, it still sound much
> higher than your total tick with a slower system (we're running at 50 MHz).
> 
> Is there anything we can do to improve these numbers? Is Clock_isr intended to
> be run uninterrupted as it is now? Can't see that much of the BSP patch code
> has anything to do with the speed of what I'm looking at right now...

It seems that the or1k has no support for add with carry? This makes basic 
64-bit operations quite expensive. How fast are the shift instructions on your 
processor? Does it support 32-integer multiplication instructions?

On the soft-core Nios 2 processor you get for example (nios2-rtems4.12-gcc -O2 
-mhw-mulx -mhw-mul -mhw-div):

uint64_t add(uint64_t a, uint64_t b)
{
  return a + b;
}

uint64_t mul(uint64_t a, uint64_t b)
{
  return a * b;
}

.size   rs, .-rs
.align  2
.global add
.type   add, @function
add:
add r2, r4, r6
cmpltu  r4, r2, r4
add r3, r5, r7
add r3, r4, r3
ret
.size   add, .-add
.align  2
.global mul
.type   mul, @function
mul:
mul r5, r5, r6
mul r7, r7, r4
mulxuu  r3, r4, r6
mul r2, r4, r6
add r5, r5, r7
add r3, r5, r3
ret
.size   mul, .-mul
___
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel


Re: Time spent in ticks...

2016-10-13 Thread Pavel Pisa
Hello Joel,

On Friday 14 of October 2016 00:56:21 Joel Sherrill wrote:
> On Thu, Oct 13, 2016 at 1:37 PM, Joel Sherrill <j...@rtems.org> wrote:
> > On Thu, Oct 13, 2016 at 11:21 AM, Jakob Viketoft <
> >
> > jakob.viket...@aacmicrotec.com> wrote:
> >> *From:* Joel Sherrill [j...@rtems.org]
> >> *Sent:* Thursday, October 13, 2016 17:38
> >> *To:* Jakob Viketoft
> >> *Cc:* devel@rtems.org
> >> *Subject:* Re: Time spent in ticks...
> >>
> >> >I don't have an or1k handy so ran on a sparc/erc32 simulator/
> >> >It is is a SPARC v7 at 15 Mhz.
> >> >
> >> >These times are in microseconds and based on the tmtests.
> >> >Specifically tm08and tm27.
> >> >
> >> >(1) rtems_clock_tick: only case - 52
> >> >(2) rtems interrupt: entry overhead returns to interrupted task - 12
> >> >(3) rtems interrupt: exit overhead returns to interrupted task - 4
> >> >(4) rtems interrupt: entry overhead returns to nested interrupt - 11
> >> >(5) rtems interrupt: exit overhead returns to nested interrupt - 3
> >
> > The above was from the master with SMP enabled. I repeated it with
> > SMP disabled and it had no impact.
> >
> > Since the timing change is post 4.11, I decided to try 4.11 with SMP
> > disabled:
> >
> > rtems_clock_tick: only case - 42
> > rtems interrupt: entry overhead returns to interrupted task - 11
> > rtems interrupt: exit overhead returns to interrupted task - 4
> > rtems interrupt: entry overhead returns to nested interrupt - 11
> > rtems interrupt: exit overhead returns to nested interrupt - 3
> >
> > So 42 + 12 + 4 = 58 microseconds, 58 * 15 = 870 cycles
> >
> > So the overhead has gone up some but as Pavel says it is quite likely
> > some mathematical operation on 64 bit types is slow on your CPU.
> >
> > HINT: If you can write a benchmark for 64-bit operations,
> > it would be a good comparison between CPUs and might
> > highlight where the software implementation needs improvement.
>
> I decided that another good point of reference was the powerpc/psim BSP. It
> reports the benchmarks in instructions:
>
> (1) rtems_clock_tick: only case - 229
> (2) rtems interrupt: entry overhead returns to interrupted task - 102
> (3) rtems interrupt: exit overhead returns to interrupted task - 95
> (4) rtems interrupt: entry overhead returns to nested interrupt - 105
> (5) rtems interrupt: exit overhead returns to nested interrupt - 85
>
> 229 + 102 + 96 = 427 instructions.
>
> That seems roughly inline with the erc32 which is 1 cycle for all
> instructions
> except loads which are 3 and stores which are 2. And the sparc has
> register windows so entering and exiting an ISR can potentially save
> and restore a lot of registers.
>
> So I am still leaning to Pavel's explanation that some primitive operation
> is really inefficient.

These numbers looks good.

I would expect that in the case of or1k there can be real penalty
if it is synthesized without multiply or barrel shifter.
Or CPU has these and compiler is set to not use them.
If that cannot be corrected (for example hardware multiplier
or shifter would cause design to not fit in FPGA) then there
is real problem and mitchmatch between RTEMS and CPU target
area. This can be solved by configurable time measurement
data type. For example use only ticks in 32-bit number
and change even timers queues to this type. It cannot be unconditional,
because today users of RTEMS expect that the time resolution
is better and that time does not overflow in longer range, ideally 2100
or more supported.

As for actual code, if I remember, I have not liked conversions
of monotonic to ticks in nanosleep and there has been some division.
The division is not in tick code (at least I thinks so). So this should
be OK. The packet sec and fractions format of timespec for one
of queues has some interresting properties but on the other hand
its repcaking has some overhead even in the tick processing.

If we take that for some CPU time spent in tick is for example 50 usec
then it is not problem if there are no deadlines in the similar range.
For example, tollerated latencies of 500 or 1000 usec and critical tasks
execution time is 300 usec then it is OK. If the tick rate is selected
1 kHz then 5% of CPU time consumption by time keeping looks like quite
a lot. If the timing of applications can tolerated tick time 0.1 sec (10 Hz)
then load contribution by tick processing is neglectable.

So all these numbers are relative to needs of planned target application.

Best wishes,

Pavel


___
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel


Re: Time spent in ticks...

2016-10-13 Thread Joel Sherrill
On Thu, Oct 13, 2016 at 1:37 PM, Joel Sherrill <j...@rtems.org> wrote:

>
>
> On Thu, Oct 13, 2016 at 11:21 AM, Jakob Viketoft <
> jakob.viket...@aacmicrotec.com> wrote:
>
>>
>> *From:* Joel Sherrill [j...@rtems.org]
>> *Sent:* Thursday, October 13, 2016 17:38
>> *To:* Jakob Viketoft
>> *Cc:* devel@rtems.org
>> *Subject:* Re: Time spent in ticks...
>>
>> >I don't have an or1k handy so ran on a sparc/erc32 simulator/
>> >It is is a SPARC v7 at 15 Mhz.
>>
>> >These times are in microseconds and based on the tmtests.
>> >Specifically tm08and tm27.
>>
>> >(1) rtems_clock_tick: only case - 52
>> >(2) rtems interrupt: entry overhead returns to interrupted task - 12
>> >(3) rtems interrupt: exit overhead returns to interrupted task - 4
>> >(4) rtems interrupt: entry overhead returns to nested interrupt - 11
>> >(5) rtems interrupt: exit overhead returns to nested interrupt - 3
>>
>>
> The above was from the master with SMP enabled. I repeated it with
> SMP disabled and it had no impact.
>
> Since the timing change is post 4.11, I decided to try 4.11 with SMP
> disabled:
>
> rtems_clock_tick: only case - 42
> rtems interrupt: entry overhead returns to interrupted task - 11
> rtems interrupt: exit overhead returns to interrupted task - 4
> rtems interrupt: entry overhead returns to nested interrupt - 11
> rtems interrupt: exit overhead returns to nested interrupt - 3
>
> So 42 + 12 + 4 = 58 microseconds, 58 * 15 = 870 cycles
>
> So the overhead has gone up some but as Pavel says it is quite likely
> some mathematical operation on 64 bit types is slow on your CPU.
>
> HINT: If you can write a benchmark for 64-bit operations,
> it would be a good comparison between CPUs and might
> highlight where the software implementation needs improvement.
>

I decided that another good point of reference was the powerpc/psim BSP. It
reports the benchmarks in instructions:

(1) rtems_clock_tick: only case - 229
(2) rtems interrupt: entry overhead returns to interrupted task - 102
(3) rtems interrupt: exit overhead returns to interrupted task - 95
(4) rtems interrupt: entry overhead returns to nested interrupt - 105
(5) rtems interrupt: exit overhead returns to nested interrupt - 85

229 + 102 + 96 = 427 instructions.

That seems roughly inline with the erc32 which is 1 cycle for all
instructions
except loads which are 3 and stores which are 2. And the sparc has
register windows so entering and exiting an ISR can potentially save
and restore a lot of registers.

So I am still leaning to Pavel's explanation that some primitive operation
is really inefficient.


>
>
>> >The clock tick test has 100 tasks but it looks like they are blocked on
>> a semaphore
>> >without timeout.
>>
>> >Your times look WAY too high. Maybe the interrupt is stuck on and
>> >not being cleared.
>>
>> >On the erc32, a nominal "nothing to do clock tick" would be 1+2+3 from
>> >above or 52+12+4 = 68 microseconds. 68 * 15 = 1020 machine cycles.
>> >So at a higher clock rate, it should be even less time.
>>
>> >My gut feeling is that I think something is wrong with the ISR handler
>> >and it is stuck. But the performance is definitely way too high.
>>
>> >--joel
>>
>> (Sorry if the format got somewhat I garbled, anything but top-posting
>> have to be done manually...)
>>
>> I re-tested my case using an -O3 optimization (we have been using -O0
>> during development for debugging purposes) and I got a good performance
>> boost, but I'm still nowhere near your numbers. I can vouch for that the
>> interrupt (exception really) isn't stuck, but that the code unfortunately
>> takes a long time to compute. I have a subsecond counter (1/16 of a second)
>> which I'm sampling at various places in the code, storing its numbers to a
>> buffer in memory so as to interfere with the program as little as possible.
>>
>> With -O3, a tick handling still takes ~320 us to perform, but the weight
>> has now shifted. tc_windup takes ~214 us and the rest is obviously
>> _Watchdog_Tick(). When fragmenting the tc_windup function to find the worst
>> speed bumps the biggest contribution (~122 us) seem to be coming from scale
>> factor recalculation. Since it's 64 bits, it's turned into a software
>> function which can be quite time-consuming apparently.
>>
>> Even though _Watchdog_Tick() "only" takes ~100 us now, it still sound
>> much higher than your total tick with a slower system (we're running at 50
>> MHz).
>>
>> Is there anything we can do to improve these nu

Re: Time spent in ticks...

2016-10-13 Thread Joel Sherrill
On Thu, Oct 13, 2016 at 11:21 AM, Jakob Viketoft <
jakob.viket...@aacmicrotec.com> wrote:

>
> *From:* Joel Sherrill [j...@rtems.org]
> *Sent:* Thursday, October 13, 2016 17:38
> *To:* Jakob Viketoft
> *Cc:* devel@rtems.org
> *Subject:* Re: Time spent in ticks...
>
> >I don't have an or1k handy so ran on a sparc/erc32 simulator/
> >It is is a SPARC v7 at 15 Mhz.
>
> >These times are in microseconds and based on the tmtests.
> >Specifically tm08and tm27.
>
> >(1) rtems_clock_tick: only case - 52
> >(2) rtems interrupt: entry overhead returns to interrupted task - 12
> >(3) rtems interrupt: exit overhead returns to interrupted task - 4
> >(4) rtems interrupt: entry overhead returns to nested interrupt - 11
> >(5) rtems interrupt: exit overhead returns to nested interrupt - 3
>
>
The above was from the master with SMP enabled. I repeated it with
SMP disabled and it had no impact.

Since the timing change is post 4.11, I decided to try 4.11 with SMP
disabled:

rtems_clock_tick: only case - 42
rtems interrupt: entry overhead returns to interrupted task - 11
rtems interrupt: exit overhead returns to interrupted task - 4
rtems interrupt: entry overhead returns to nested interrupt - 11
rtems interrupt: exit overhead returns to nested interrupt - 3

So 42 + 12 + 4 = 58 microseconds, 58 * 15 = 870 cycles

So the overhead has gone up some but as Pavel says it is quite likely
some mathematical operation on 64 bit types is slow on your CPU.

HINT: If you can write a benchmark for 64-bit operations,
it would be a good comparison between CPUs and might
highlight where the software implementation needs improvement.


> >The clock tick test has 100 tasks but it looks like they are blocked on a
> semaphore
> >without timeout.
>
> >Your times look WAY too high. Maybe the interrupt is stuck on and
> >not being cleared.
>
> >On the erc32, a nominal "nothing to do clock tick" would be 1+2+3 from
> >above or 52+12+4 = 68 microseconds. 68 * 15 = 1020 machine cycles.
> >So at a higher clock rate, it should be even less time.
>
> >My gut feeling is that I think something is wrong with the ISR handler
> >and it is stuck. But the performance is definitely way too high.
>
> >--joel
>
> (Sorry if the format got somewhat I garbled, anything but top-posting have
> to be done manually...)
>
> I re-tested my case using an -O3 optimization (we have been using -O0
> during development for debugging purposes) and I got a good performance
> boost, but I'm still nowhere near your numbers. I can vouch for that the
> interrupt (exception really) isn't stuck, but that the code unfortunately
> takes a long time to compute. I have a subsecond counter (1/16 of a second)
> which I'm sampling at various places in the code, storing its numbers to a
> buffer in memory so as to interfere with the program as little as possible.
>
> With -O3, a tick handling still takes ~320 us to perform, but the weight
> has now shifted. tc_windup takes ~214 us and the rest is obviously
> _Watchdog_Tick(). When fragmenting the tc_windup function to find the worst
> speed bumps the biggest contribution (~122 us) seem to be coming from scale
> factor recalculation. Since it's 64 bits, it's turned into a software
> function which can be quite time-consuming apparently.
>
> Even though _Watchdog_Tick() "only" takes ~100 us now, it still sound much
> higher than your total tick with a slower system (we're running at 50 MHz).
>
> Is there anything we can do to improve these numbers? Is Clock_isr
> intended to be run uninterrupted as it is now? Can't see that much of the
> BSP patch code has anything to do with the speed of what I'm looking at
> right now...
>
>  /Jakob
>
>
>
> *Jakob Viketoft *Senior Engineer in RTL and embedded software
>
> ÅAC Microtec AB
> Dag Hammarskjölds väg 48
> SE-751 83 Uppsala, Sweden
>
> T: +46 702 80 95 97
> http://www.aacmicrotec.com
>
___
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel

Re: Time spent in ticks...

2016-10-13 Thread Pavel Pisa
Hello Jakob,

On Thursday 13 of October 2016 18:21:05 Jakob Viketoft wrote:
> I re-tested my case using an -O3 optimization (we have been using -O0
> during development for debugging purposes) and I got a good performance
> boost, but I'm still nowhere near your numbers. I can vouch for that the
> interrupt (exception really) isn't stuck, but that the code unfortunately
> takes a long time to compute. I have a subsecond counter (1/16 of a second)
> which I'm sampling at various places in the code, storing its numbers to a
> buffer in memory so as to interfere with the program as little as possible.
>
> With -O3, a tick handling still takes ~320 us to perform, but the weight
> has now shifted. tc_windup takes ~214 us and the rest is obviously
> _Watchdog_Tick(). When fragmenting the tc_windup function to find the worst
> speed bumps the biggest contribution (~122 us) seem to be coming from scale
> factor recalculation. Since it's 64 bits, it's turned into a software
> function which can be quite time-consuming apparently.
>
> Even though _Watchdog_Tick() "only" takes ~100 us now, it still sound much
> higher than your total tick with a slower system (we're running at 50 MHz).
>
> Is there anything we can do to improve these numbers? Is Clock_isr intended
> to be run uninterrupted as it is now? Can't see that much of the BSP patch
> code has anything to do with the speed of what I'm looking at right now...

the time is measured and timers queue use 64-bit types for time
representation. When higher time measurement resolution than tick
is requested then it is reasonable (optimal) choice but it can be problem
for 16-bit CPUs and some 32-bit one as well.

How you have configured or1k CPU? Have you available hardware multiplier
and barrel shifter or only shift by one and multiplier in SW?
Do the CFLAGS match available instructions?

I am not sure, if there is not 64 division in the time computation
either. This is would be a killer for your CPU. The high resolution
time sources and even tickless timers support can be implemented
with full scaling and adjustment with only shifts, addition and 
multiplications in hot paths.

I have tried to understand to actual RTEMS time-keeping code
some time ago when nanosleep has been introduced and
I have tried to analyze, propose some changes and compared
it to Linux. See the thread following next messages

  https://lists.rtems.org/pipermail/devel/2016-August/015720.html

  https://lists.rtems.org/pipermail/devel/2016-August/015721.html

Some discussed changes to nanosleep has been been implemented
already.

Generally, try to measure how many times multiplication
and division is called in ISR.
I think that I am capable to design implementation which
restricted to mul, add and shr and minimizes number
of transformations but if it sis found that RTEMS implementation
needs to be optimized/changed then it can be task counted
in man months.

Generally, if tick interrupt last more than 10 (may be 20) usec then
there is problem. One its source can be SW implementation ineffectiveness
other that OS selected and possibly application required features
are above selected CPU capabilities.

Best wishes,


Pavel
___
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel


RE: Time spent in ticks...

2016-10-13 Thread Jakob Viketoft

From: Joel Sherrill [j...@rtems.org]
Sent: Thursday, October 13, 2016 17:38
To: Jakob Viketoft
Cc: devel@rtems.org
Subject: Re: Time spent in ticks...

>I don't have an or1k handy so ran on a sparc/erc32 simulator/
>It is is a SPARC v7 at 15 Mhz.

>These times are in microseconds and based on the tmtests.
>Specifically tm08and tm27.

>(1) rtems_clock_tick: only case - 52
>(2) rtems interrupt: entry overhead returns to interrupted task - 12
>(3) rtems interrupt: exit overhead returns to interrupted task - 4
>(4) rtems interrupt: entry overhead returns to nested interrupt - 11
>(5) rtems interrupt: exit overhead returns to nested interrupt - 3

>The clock tick test has 100 tasks but it looks like they are blocked on a 
>semaphore
>without timeout.

>Your times look WAY too high. Maybe the interrupt is stuck on and
>not being cleared.

>On the erc32, a nominal "nothing to do clock tick" would be 1+2+3 from
>above or 52+12+4 = 68 microseconds. 68 * 15 = 1020 machine cycles.
>So at a higher clock rate, it should be even less time.

>My gut feeling is that I think something is wrong with the ISR handler
>and it is stuck. But the performance is definitely way too high.

>--joel

(Sorry if the format got somewhat I garbled, anything but top-posting have to 
be done manually...)

I re-tested my case using an -O3 optimization (we have been using -O0 during 
development for debugging purposes) and I got a good performance boost, but I'm 
still nowhere near your numbers. I can vouch for that the interrupt (exception 
really) isn't stuck, but that the code unfortunately takes a long time to 
compute. I have a subsecond counter (1/16 of a second) which I'm sampling at 
various places in the code, storing its numbers to a buffer in memory so as to 
interfere with the program as little as possible.

With -O3, a tick handling still takes ~320 us to perform, but the weight has 
now shifted. tc_windup takes ~214 us and the rest is obviously 
_Watchdog_Tick(). When fragmenting the tc_windup function to find the worst 
speed bumps the biggest contribution (~122 us) seem to be coming from scale 
factor recalculation. Since it's 64 bits, it's turned into a software function 
which can be quite time-consuming apparently.

Even though _Watchdog_Tick() "only" takes ~100 us now, it still sound much 
higher than your total tick with a slower system (we're running at 50 MHz).

Is there anything we can do to improve these numbers? Is Clock_isr intended to 
be run uninterrupted as it is now? Can't see that much of the BSP patch code 
has anything to do with the speed of what I'm looking at right now...

 /Jakob


Jakob Viketoft
Senior Engineer in RTL and embedded software

ÅAC Microtec AB
Dag Hammarskjölds väg 48
SE-751 83 Uppsala, Sweden

T: +46 702 80 95 97
http://www.aacmicrotec.com
___
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel

Re: Time spent in ticks...

2016-10-13 Thread Joel Sherrill
On Thu, Oct 13, 2016 at 3:51 AM, Jakob Viketoft <
jakob.viket...@aacmicrotec.com> wrote:

> Hello everyone,
>
> We're running on an or1k-based BSP off of 4.11 (with the patches I've
> forwarded in February last year) and have seen some strange sluggishness in
> the system. When measuring using a standalone peripheral clock, I can see
> that we spend between 0.8 - 1.4 ms just handling the tick. This sounds a
> bit absurd to me and I just wanted to send out a couple of questions to see
> if anyone has an inkling of what is going on. I haven't been able to test
> with the or1k-simulator (and the generic_or1k BSP) as it won't easily
> compile with a newer gcc, but I'm running on real hardware. The patches I
> made don't sound like big hold-ups to me either, but a second pair of eyes
> is of course always welcome.
>
> To the questions:
> 1. On the or1k-cpu RTEMS bsp, timer ticks are using the cpu-internal
> timer, which when timing out results in a timer exception. Clock_isr is
> installed as the exception handler for this and thus have complete control
> of the cpu for it's duration. Is this as the Clock_isr is intended to run,
> i.e. no other tasks or interrupts are allowed during tick handling? Just
> want to make sure there is no mismatch between the or1k setup in RTEMS and
> how Clock_isr is intended to run.
>
> 2. Running a very simple test application with three tasks, I delved into
> the _Timecounter_Tick part of the Clock_isr and I have seen the tc_windup()
> is using ~340 us quite consistently and _Watchdog_Tick() is using ~630 when
> all tasks are started. What numbers can be seen at other systems, i.e. what
> should I expect as normal here? Any ideas on what can be wrong? I'll keep
> digging and try to discern any individual culprits as well.
>
>
I don't have an or1k handy so ran on a sparc/erc32 simulator/
It is is a SPARC v7 at 15 Mhz.

These times are in microseconds and based on the tmtests.
Specifically tm08and tm27.

(1) rtems_clock_tick: only case - 52
(2) rtems interrupt: entry overhead returns to interrupted task - 12
(3) rtems interrupt: exit overhead returns to interrupted task - 4
(4) rtems interrupt: entry overhead returns to nested interrupt - 11
(5) rtems interrupt: exit overhead returns to nested interrupt - 3

The clock tick test has 100 tasks but it looks like they are blocked on a
semaphore
without timeout.

Your times look WAY too high. Maybe the interrupt is stuck on and
not being cleared.

On the erc32, a nominal "nothing to do clock tick" would be 1+2+3 from
above or 52+12+4 = 68 microseconds. 68 * 15 = 1020 machine cycles.
So at a higher clock rate, it should be even less time.

My gut feeling is that I think something is wrong with the ISR handler
and it is stuck. But the performance is definitely way too high.

--joel


> Oh, and we use 1 as base for the tick quantum.
>
> (If anyone is interested in looking at our code, bsps and toolchains can
> be downloaded at repo.aacmicrotec.com.)
>
> Best regards,
>
>   /Jakob
>
>
> Jakob Viketoft
> Senior Engineer in RTL and embedded software
>
> ÅAC Microtec AB
> Dag Hammarskjölds väg 48
> SE-751 83 Uppsala, Sweden
>
> T: +46 702 80 95 97
> http://www.aacmicrotec.com
> ___
> devel mailing list
> devel@rtems.org
> http://lists.rtems.org/mailman/listinfo/devel
>
___
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel

Time spent in ticks...

2016-10-13 Thread Jakob Viketoft
Hello everyone,

We're running on an or1k-based BSP off of 4.11 (with the patches I've forwarded 
in February last year) and have seen some strange sluggishness in the system. 
When measuring using a standalone peripheral clock, I can see that we spend 
between 0.8 - 1.4 ms just handling the tick. This sounds a bit absurd to me and 
I just wanted to send out a couple of questions to see if anyone has an inkling 
of what is going on. I haven't been able to test with the or1k-simulator (and 
the generic_or1k BSP) as it won't easily compile with a newer gcc, but I'm 
running on real hardware. The patches I made don't sound like big hold-ups to 
me either, but a second pair of eyes is of course always welcome.

To the questions:
1. On the or1k-cpu RTEMS bsp, timer ticks are using the cpu-internal timer, 
which when timing out results in a timer exception. Clock_isr is installed as 
the exception handler for this and thus have complete control of the cpu for 
it's duration. Is this as the Clock_isr is intended to run, i.e. no other tasks 
or interrupts are allowed during tick handling? Just want to make sure there is 
no mismatch between the or1k setup in RTEMS and how Clock_isr is intended to 
run.

2. Running a very simple test application with three tasks, I delved into the 
_Timecounter_Tick part of the Clock_isr and I have seen the tc_windup() is 
using ~340 us quite consistently and _Watchdog_Tick() is using ~630 when all 
tasks are started. What numbers can be seen at other systems, i.e. what should 
I expect as normal here? Any ideas on what can be wrong? I'll keep digging and 
try to discern any individual culprits as well. 

Oh, and we use 1 as base for the tick quantum.

(If anyone is interested in looking at our code, bsps and toolchains can be 
downloaded at repo.aacmicrotec.com.)

Best regards,

  /Jakob


Jakob Viketoft
Senior Engineer in RTL and embedded software

ÅAC Microtec AB
Dag Hammarskjölds väg 48
SE-751 83 Uppsala, Sweden

T: +46 702 80 95 97
http://www.aacmicrotec.com
___
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel