Re: Port of project from NuttX 7.30 to 10.1 RC1: Unexpected IRQ

Sebastien Lorquet Fri, 28 May 2021 06:12:53 -0700

I'm not talking of renaming code symbols like up_anything toarm_anything, which makes sense and can be noticed quite easily at linkstage.

I'm talking about renaming a shell/make variable from EXTRADEFINES toEXTRAFLAGS. This is hidden, and the build system has no way to complainabout anything missing.

Of course, this has no impact on any built-in board, because the changewas made locally. so CI cannot detect this.


But this breaks ALL the custom boards people actually use for real projects.

And the relevant documentation is hidden at the bottom of some obsolete(nuttx 9) release notes, in a "concerns" section.

This doc is critical from anyone porting a custom board from pre-9 nuttxto current one.


I cant believe I'm the only one in this case...

Sebastien

Le 27/05/2021 à 16:51, Alan Carvalho de Assis a écrit :

I think a benefit from renaming many of those "up_something" to
"stm32_something", "esp32_something", etc is now it is easy for
software find the right function.

I think many IDEs cannot handle functions search correctly for NuttX
because they don't have heuristics to know that IF I'm searching a
function inside a board or inside an arch, it shouldn't return a
function with same name from other board or from other arch.

So, at end-of-day, these modifications you are complain about, will
make the life of all users better.

BR,

Alan

On 5/27/21, Sebastien Lorquet <sebast...@lorquet.fr> wrote:

I sill wonder what is the purpose of this variable rename. Sorry to say,
but it just looks cosmetic while critically breaking everything that was
made before, and this kind of thing is a nightmare for migration when
you cant follow the project day to day. Boards can be external to the
project, and are a supported feature, so they should continue to work
reliably even if you change the internal sauce!

At one point there was too many trafic on the mailing list and I just
stopped reading it, I marked several hundreds of messages as read
without having the time to go through then. It seems that this change
was made during this time.

Sebastien

Le 27/05/2021 à 09:38, Sebastien Lorquet a écrit :

Boom, that was the extrastuff. The board now boots. We're going to run
a lot of functional tests to make sure everything is okay, but I dont
have this strange hardfault at boot.

Thank you.

I did not find this page despite searching through a lot of
documentation, mainly the "official" ReadTheDocs-like documentation.

I suggest you link to this doc in the getting started manuals.

Sebastien


Le 26/05/2021 à 18:42, Abdelatif Guettouche a écrit :

Maybe this one could help:
https://cwiki.apache.org/confluence/display/NUTTX/NuttX+9.1#NuttX9.1-CompatibilityConcerns

I am using the flat (monolithic build) and I see no place that define
this flag, at all.
I dont even see a place in the codebase that defines this flag.

__KERNEL__ is defined in tools/Config.mk (line:100)

The fact that mm_initialize only shows one region is weird... where is

the heap for the main RAM at 0x20000000?

CONFIG_MM_REGIONS needs to be set up correctly if you have multiple
heap regions.

On Wed, May 26, 2021 at 5:22 PM Sebastien Lorquet
<sebast...@lorquet.fr> wrote:

Hello,

Thanks for the remarks.

I am using the flat (monolithic build) and I see no place that define
this flag, at all.

I dont even see a place in the codebase that defines this flag.

I see nothing related to mm, nor anything outdated in my Make.defs,
which is from my old setup, yes, but still similar to a recent one.

Sebastien

Le 26/05/2021 à 18:08, raiden00pl a écrit :

If you use CONFIG_BUILD_FLAT=y, make sure that __KERNEL__ flag is
set here:
https://github.com/apache/incubator-nuttx/blob/master/include/nuttx/mm/mm.h#L85


I remember that at some point I had a similar hardfault in mm which
doesn't
make sense and it was due to outdated board Make.defs.

śr., 26 maj 2021 o 17:21 Sebastien Lorquet <sebast...@lorquet.fr>
napisał(a):

Update: stack dump and register analysis are in fact pointing to a
crash
in mm_alloc

I have enabled memory management debug:

mm_initialize: Heap: start=0x10000000 size=65536
mm_addregion: Region 1: base=0x10000154 size=65184
stm32_netinitialize: Enabling PHY power
stm32_netinitialize: PHY reset...
stm32_netinitialize: PHY reset done.
stm32_netinitialize: Configuring PHY int
F
mm_free: Freeing 0x70fb460b
irq_unexpected_isr: ERROR irq: 3
up_assert: Assertion failed at file:irq/irq_unexpectedisr.c line: 50
up_registerdump: R0: 00000001 2000737c c00000f2 08000101 00000000
00000000 00000000 200073c8
up_registerdump: R8: 00000000 00000000 00000000 00000000 00000000
200073c8 080126ad 080126f8
up_registerdump: xPSR: 21000000 PRIMASK: 00000000 CONTROL: 00000000
up_registerdump: EXC_RETURN: fffffff9
up_dumpstate: sp:         200072c8
up_dumpstate: stack base: 20007078
up_dumpstate: stack size: 00000400

The fact that mm_initialize only shows one region is weird...
where is
the heap for the main RAM at 0x20000000?

the mm_free(0x70fb460b) is not what causes the hardfault (it comes
later), but what the hell is is this invalid address!

This is the first call to mm_free, here is the backtrace:

Breakpoint 1, mm_free (heap=0x200060b4 <g_mmheap>, mem=0x70fb460b) at
mm_heap/mm_free.c:85
85        if (!mem)
(gdb) bt
#0  mm_free (heap=0x200060b4 <g_mmheap>, mem=0x70fb460b) at
mm_heap/mm_free.c:85
#1  0x0801264a in mm_free_delaylist (heap=0x200060b4 <g_mmheap>) at
mm_heap/mm_malloc.c:82
#2  0x08012672 in mm_malloc (heap=0x200060b4 <g_mmheap>, size=24) at
mm_heap/mm_malloc.c:115
#3  0x08012a32 in mm_zalloc (heap=0x200060b4 <g_mmheap>, size=24) at
mm_heap/mm_zalloc.c:45
#4  0x080123ac in zalloc (size=24) at umm_heap/umm_zalloc.c:68
#5  0x080399fa in inode_alloc (name=0x8059a78 "") at
inode/fs_inodereserve.c:78
#6  0x08039a5c in inode_root_reserve () at
inode/fs_inodereserve.c:129
#7  0x080398cc in inode_initialize () at inode/fs_inode.c:92
#8  0x08039284 in fs_initialize () at fs_initialize.c:47
#9  0x08007eb4 in nx_start () at init/nx_start.c:600
#10 0x0800421e in __start () at chip/stm32_start.c:338

As previously analyzed, this happens in fs_initialize through
inode_root_reserve, so I was on the right track.

Caller shows mm_free called with that weird address:

(gdb) f 1
#1  0x0801264a in mm_free_delaylist (heap=0x200060b4 <g_mmheap>) at
mm_heap/mm_malloc.c:82
82            mm_free(heap, address);
(gdb) list
77
78            /* The address should always be non-NULL since that was
checked in the
79             * 'while' condition above.
80             */
81
82            mm_free(heap, address); <-- address == 0x70fb460b
83          }
84      #endif
85      }
86

(gdb) print &g_mmheap
$3 = (struct mm_heap_s *) 0x200060b4 <g_mmheap>
(gdb) print g_mmheap
$4 = {mm_impl = 0x0}

this is not good!

This is not a timing or IRQ related issue but a heap issue.

R15 = 080126f8 translates to here:


https://github.com/apache/incubator-nuttx/blob/master/mm/mm_heap/mm_malloc.c#L199



=> this free() has corrupted a badly initialized heap, and the next
malloc fails, giving a hardfault because that address is invalid.

Horrific mess!

==>

I think that my old board code does not initialize the board
properly, I
probably have to check for differences between my code and the
stm32f429i-disco built-in board (on which I based my board).

Sebastien

Le 25/05/2021 à 21:26, Nathan Hartman a écrit :

On Tue, May 25, 2021 at 12:02 PM Sebastien Lorquet
<sebast...@lorquet.fr

wrote:

Back to the business

After this we managed to recompile our project using the
latest NuttX
sources, but it fails when trying to init the PHY irq on our
STM32F427
board: We get "unexpected IRQ".

Yes I know that's pretty vague :-)

Is there anything obvious I should have been careful with in this
domain, before I dig the jtag probe to fix it (tomorrow) ?

I would first start by looking through the Release Notes
between v7.30
and v10.1. Many big improvements and bug fixes happened and
some of
them are mentioned in Compatibility Concerns along with some
changes
you might need to make to configuration etc.

Also another thing you can try: Has this board and PHY worked
correctly with v7.30? If so, you can bisect and with very few
tests
(I'm guessing fewer than 20) find the exact commit that broke it.

Release notes are hard to read but I did not find anything
special about
phy interrupts.

Note that it may not be the phy interrupt. Here is my log:

stm32_netinitialize: Enabling PHY power
stm32_netinitialize: PHY reset...
stm32_netinitialize: PHY reset done.
stm32_netinitialize: Configuring PHY int
F
irq_unexpected_isr: ERROR irq: 3
up_assert: Assertion failed at file:irq/irq_unexpectedisr.c
line: 50
up_registerdump: R0: 00000001 2000737c c00000f2 08000101 00000000
00000000 00000000 200073c8
up_registerdump: R8: 00000000 00000000 00000000 00000000 00000000
200073c8 080126ad 080126f8
up_registerdump: xPSR: 21000000 PRIMASK: 00000000 CONTROL: 00000000
up_registerdump: EXC_RETURN: fffffff9

A lot of OS initialization things happen at the point, marked by
the
letter F.

It seems that an unexpected IRQ happens in this interval, around
the
time the filesystem is initialized. The backtrace goes down to
memory
allocation routines through the initialization of the root inode.

My guess is that AN external IRQ is triggered (possibly not the
PHY IRQ)
but the ISR handler for that one is not ready yet. I will add debug
messages.


I would expect that situation to be a simple NOP, but it seems that
undefined handlers are set to this function "irq_unexpected_isr"

Is that a new behaviour? a default config that I did not set
properly
when porting our old defconfig?

Sebastien

Nathan

Did you try disabling the PHY (or networking) in Kconfig to see if

removing

it from the build will eliminate the hardfault?

Have you seen this about hardfault debugging:

https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=139629445#content/view/139629445

Nathan

Re: Port of project from NuttX 7.30 to 10.1 RC1: Unexpected IRQ

Reply via email to