Re: Random lwIP Crashes in _POSIX_Mutex_Lock_support()

Isaac Gutekunst Fri, 23 Oct 2015 06:06:24 -0700

On 10/23/2015 08:50 AM, Joel Sherrill wrote:



On October 23, 2015 7:02:34 AM CDT, Joel Sherrill <joel.sherr...@oarcorp.com> 
wrote:



On October 22, 2015 1:37:18 PM CDT, Isaac Gutekunst
<isaac.guteku...@vecna.com> wrote:

I think I may have some information that's actually useful.

I've managed to actually execute some tests.... and lots of them are
failing.

sp01 and sp02 fail quite quickly, as an assertion fails.

assertion "first != _Chain_Tail( &ready_queues[ index ] )" failed:

file


"../../cpukit/../../../stm32f7x/lib/
include/rtems/score/schedulerpriorityimpl.h", line
166, function: _Scheduler_priority_Ready_queue_first

This failure is common to many of the failed tests so far. What does
this mean?


Does hello run?

Yes

Yes. Here are the results of running all the samples:


RTEMS Testing - Tester, v0.2.0
[1/9] p:0 f:0 t:0 i:0 | arm/stm32f7x: base_sp.exe
[2/9] p:1 f:0 t:0 i:0 | arm/stm32f7x: capture.exe
[3/9] p:1 f:1 t:0 i:0 | arm/stm32f7x: fileio.exe
[4/9] p:2 f:1 t:0 i:0 | arm/stm32f7x: hello.exe
[5/9] p:3 f:1 t:0 i:0 | arm/stm32f7x: minimum.exe
[6/9] p:4 f:1 t:0 i:0 | arm/stm32f7x: nsecs.exe
[7/9] p:4 f:2 t:0 i:0 | arm/stm32f7x: paranoia.exe
[8/9] p:5 f:2 t:0 i:0 | arm/stm32f7x: ticker.exe
[9/9] p:5 f:3 t:0 i:0 | arm/stm32f7x: unlimited.exe
Passed: 5
Failed: 4
Timeouts: 0
Invalid: 0
Total: 9
Average test time: 0:00:13.072150
Testing time : 0:01:57.649357

Sorry to follow up so quickly.  Set your console driver to polling mode. Run 
hello. It should work before you move on. There should be no interrupts at all.

Then turn console interrupts on and get it running again. There should only be 
console interrupts.

Then move on to ticker. Clock tick and console if not in polled.


Ticker does not seem to work. This is a concrete area to investigate. I have 
some suspicions.


Did all interrupt handlers get installed using RTEMS ISR installation method? 
Are there any which do not use this?

I think they all use the RTEMS method. There might be some lurking in the HAL code from ST. Ifso, that's a bit scary and could maybe explain a lot.


For all RTEMS standard tests, there should be no drivers except those 
explicitly configured. All other devices may need to be tickled to be quiet but 
usually are not touched or referenced.

If the minimum test has driver symbols in it, it shouldn't. :)

We'll clean that up.

Isaac

On 10/22/2015 09:16 AM, Jay Doyle wrote:



On 10/22/2015 01:40 AM, Sebastian Huber wrote:



On 21/10/15 15:48, Jay Doyle wrote:



On 10/21/2015 09:35 AM, Sebastian Huber wrote:



On 21/10/15 15:08, Isaac Gutekunst wrote:



On 10/21/2015 09:00 AM, Sebastian Huber wrote:



On 21/10/15 14:56, Isaac Gutekunst wrote:

On 10/21/2015 08:24 AM, Sebastian Huber wrote:



On 21/10/15 14:13, Isaac Gutekunst wrote:

Thanks for the reply.

On 10/21/2015 01:50 AM, Sebastian Huber wrote:



On 20/10/15 16:02, Isaac Gutekunst wrote:

[...]


As far as I can tell this would only occur if the caller

of

pthread_mutex_lock was in a
"bad"
state. I don't believe it is in an interrupt context, and
don't know what other bad states
could exist.


We have

#define _CORE_mutex_Check_dispatch_for_seize(_wait) \
    (!_Thread_Dispatch_is_enabled() \
      && (_wait) \
      && (_System_state_Get() >= SYSTEM_STATE_UP))

What is the thread dispatch disable level and the system

state

at this point?

In case the thread dispatch disable level is not zero, then
something is probably broken
in the
operating system code which is difficult to find. Could be

general memory corruption
problem
too. Which RTEMS version do you use?


The thread dispatch disable level is usually -1 or -2.
(0xFFFFFFFE or 0xFFFFFFD).


A negative value is very bad, but easy to detect via manual
instrumentation (only an hand full
of spots touch this variable) or hardware
breakpoints/watchpoints. Looks the rest of
_Per_CPU_Information all right?

It looks like it's only the thread_dispatch_disable_level

that's

broken.

We'll go and grep for all places for all the places it's

touched,

and look for something.

The problem with watchpoints is they fire exceptionally often,

and

putting in a conditional
watchpoint slows the code to a crawl, but that may be worth

it.


Here are some printouts of the relevant structs right after a

crash:


$4 = {
   cpu_per_cpu = {<No data fields>},
   isr_nest_level = 0,
   thread_dispatch_disable_level = 4294967295,
   executing = 0xc01585c8,
   heir = 0xc0154038,
   dispatch_necessary = true,
   time_of_last_context_switch = {
     sec = 2992,
     frac = 10737447432380511034
   },
   Stats = {<No data fields>}
}


No, this doesn't look good. According to the stack trace you

are

in

thread context. However, we
have executing != heir and dispatch_necessary == true. This is

broken state itself. I guess,
something is wrong with the interrupt level so that a context
switch is blocked. On ARMv7-M
this is done via the system call exception.

This is a bit beyond my RTEMS knowledge. What would you advise
looking into?


I would try to instrument the code to figure out where the thread
dispatch disable level goes negative.


We just did.  I added a check in _ARMV7M_Interrupt_service_leave

to

see if the _Thread_Dispatch_disable_level is positive before the
decrementing it and this eventually fails.

I'm not sure if this tells us much because I think the call itself
correct.  In this particular case it is processing an I2C

interrupt.

I will try to see if we can capture information about the sequence

of

changes to the _Thread_Dispatch_disable_level just before the

point

in

which we know something is clearly wrong (i.e., decreasing it

below

zero.)


Since the isr_nest_level is 0, I don't think its a problem with the

spots that use

_ARMV7M_Interrupt_service_leave(). Did you check the interrupt

priorities? See also


https://lists.rtems.org/pipermail/users/2015-June/029155.html

Thanks for the pointer to this posting.  It seems like a very

similar

situation to what we are

experiencing -- especially considering that we invoke an RTEMS call

in our ethernet isr.

Unfortunately, all our interrupts use the default interrupt priority

level set in the bsp

header file as:

#define BSP_ARMV7M_IRQ_PRIORITY_DEFAULT (13 << 4)

which should be mean that they are all non-NMIs unless we explicitly

set their interrupt level

lower.




_______________________________________________
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel


--joel
_______________________________________________
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel


--joel

_______________________________________________
devel mailing list
devel@rtems.org
http://lists.rtems.org/mailman/listinfo/devel

Re: Random lwIP Crashes in _POSIX_Mutex_Lock_support()

Reply via email to