Re: Re: Re:Re: mm/mm_heap assertion error

Alan C. Assis Tue, 12 Mar 2024 05:44:21 -0700

Please watch video #54 at NuttX Channel, I explained how to use it.

I think we are missing a documentation about it here:
Documentation/applications/system/stackmonitor/index.rst


Best Regards,

Alan

On Tue, Mar 12, 2024 at 9:15 AM yfliu2008 <yfliu2...@qq.com.invalid> wrote:

> Alan, thank you!
>
>
> did you mean this SCHED_STACK_RECORD thing? I set 32 for that and can see
> things like below on the target:
>
>
> &gt; remote&gt; cat /proc/3/stack &nbsp; &nbsp; &nbsp; &nbsp; &gt;
> StackAlloc: 0x7092000
> &gt; StackBase: &nbsp;0x7092050
> &gt; StackSize: &nbsp;4016
> &gt; StackMax: &nbsp; 118042624
> &gt; Size &nbsp; &nbsp; &nbsp; &nbsp;Backtrace
> &gt; StackUsed: &nbsp;1552
>
>
>
>
> The "StackMax"&nbsp; above is 0x7093000 (118042624). But how can this work
> for the short-lived threads like "AppBringUp" thread?
>
>
>
> Regards,
> yf
>
>
>
>
> Original
>
>
>
> From:"Alan C. Assis"< acas...@gmail.com &gt;;
>
> Date:2024/3/12 18:56
>
> To:"dev"< dev@nuttx.apache.org &gt;;
>
> Subject:Re: Re:Re: mm/mm_heap assertion error
>
>
> You can use the stack monitor to see the stack consumption.
>
> Best Regards,
>
> Alan
>
> On Tue, Mar 12, 2024 at 7:38 AM yfliu2008  wrote:
>
> &gt; Dear experts,
> &gt;
> &gt;
> &gt;
> &gt; After enlarging the stack size of "AppBringUp"&nbsp; thread, the
> remote
> &gt; node can boot NSH on RPMSGFS now. I am sorry for not trying this
> earlier. I
> &gt; was browsing the "rpmsgfs.c" blindly and noticed a few auto variables
> &gt; defined in the stack... then I thought it might worth a try so I did
> it.
> &gt;
> &gt;
> &gt; Now I am still unclear about why small stack leads to heap corruption?
> &gt; Also how we can read this stack issue&nbsp; from stackdump logs? Let
> me
> &gt; know if you have any hints.
> &gt;
> &gt;
> &gt; Regards,
> &gt; yf
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt; Original
> &gt;
> &gt;
> &gt;
> &gt; From:"yfliu2008"< yfliu2...@qq.com &gt;;
> &gt;
> &gt; Date:2024/3/12 15:10
> &gt;
> &gt; To:"dev"< dev@nuttx.apache.org &gt;;
> &gt;
> &gt; Subject:Re:Re: mm/mm_heap assertion error
> &gt;
> &gt;
> &gt; Nathan,
> &gt;
> &gt;
> &gt; Here I disabled RPMsg UART device initialization but the crash still
> &gt; happens, I don't see other options to disable for now. On the other
> hand,
> &gt; if we choose not mounting NSH from the RPMSGFS, it can boot smoothly
> and
> &gt; after boot we can manually mount the RPMSGFS for playing.
> &gt;
> &gt;
> &gt;
> &gt;
> &gt; I uploaded the logs, callstacks and ELFs at
> &gt; https://github.com/yf13/hello/tree/debug-logs/nsh-rpmsgfs . There
> are two
> &gt; sets from two ELFs created from same code base but with different
> DEBUG _xx
> &gt; configs.&nbsp; The crash happens earlier in the build with more debug
> &gt; options.
> &gt;
> &gt;
> &gt; Please let me know if you have any more suggestions.
> &gt;
> &gt;
> &gt; Regards,
> &gt; yf
> &gt;
> &gt;
> &gt;
> &gt;
> &gt; Original
> &gt;
> &gt;
> &gt;
> &gt; From:"Nathan Hartman"< hartman.nat...@gmail.com &gt;;
> &gt;
> &gt; Date:2024/3/12 1:27
> &gt;
> &gt; To:"dev"< dev@nuttx.apache.org &gt;;
> &gt;
> &gt; Subject:Re: mm/mm_heap assertion error
> &gt;
> &gt;
> &gt; What's needed is some way to binary search where the culprit is.
> &gt;
> &gt; If I understand correctly, it looks like the crash is happening in the
> &gt; later stages of board bring-up? What is running before that? Can parts
> &gt; be disabled or skipped to see if the problem goes away?
> &gt;
> &gt; Another idea is to try running a static analysis tool on the sources
> &gt; and see if it finds anything suspicious to be looked into more
> &gt; carefully.
> &gt;
> &gt;
> &gt; On Mon, Mar 11, 2024 at 10:00 AM Gregory Nutt  wrote:
> &gt; &gt;
> &gt; &gt; The reason that the error is confusing is because the error
> probably
> &gt; did
> &gt; &gt; not occur at the time of the assertion; it probably occurred much
> &gt; earlier.
> &gt; &gt;
> &gt; &gt; In most crashes due to heap corruption there are two players:
> the
> &gt; &gt; culprit and the victim threads.  The culprit thread actually
> cause the
> &gt; &gt; corruption.  But at the time of the corruption, no error
> occurs.  The
> &gt; &gt; error will not occur until later.
> &gt; &gt;
> &gt; &gt; So sometime later, the victim thread runs, encounters the
> clobbered
> &gt; heap
> &gt; &gt; and crashes.  In this case, "AppBringup" and "rptun" are
> potential
> &gt; &gt; victim threads.  The fact that they crash tell you very little
> about
> &gt; the
> &gt; &gt; culprit.
> &gt; &gt;
> &gt; &gt; On 3/10/2024 6:51 PM, yfliu2008 wrote:
> &gt; &gt; &gt; Gregory, thank you for the analysis.
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; The crashes happened during system booting up, mostly at
> &gt; "AppBringup" or "rptun" threads, as per the assertion logs. The other
> &gt; threads existing are the "idle" and the "lpwork" threads as per the
> sched
> &gt; logs. There should be no other threads as NSH creation is still
> &gt; ongoing.&nbsp; As for interruptions, the UART and IPI are running in
> kernel
> &gt; space and MTIMER are in NuttSBI space.&nbsp; The NSH is loaded from a
> &gt; RPMSGFS volume, thus there are a lot RPMSG communications.
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; Is the KASAN proper for use in Kernel mode?&nbsp;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; With MM_KASAN_ALL it reports a read access error:
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; BCkasan_report: kasan detected a read access error, address
> at
> &gt; 0x708fe90,size is 8, return address: 0x701aeac
> &gt; &gt; &gt;
> &gt; &gt; &gt; _assert: Assertion failed panic: at file: kasan/kasan.c:117
> &gt; task: Idle_Task process: Kernel 0x70023c0
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; The call stack looks like:
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; #0 &nbsp;_assert (filename=0x7060f78 "kasan/kasan.c",
> &gt; linenum=117, msg=0x7060ff0 "panic", regs=0x7082720 &gt; &gt; #2
> &gt; &nbsp;0x00000000070141d6 in kasan_report (addr=0x708fe90, size=8,
> &gt; is_write=false, return_address=0x701aeac &gt; &gt; #3
> &gt; &nbsp;0x0000000007014412 in kasan_check_report (addr=0x708fe90,
> size=8,
> &gt; is_write=false, return_address=0x701aeac &gt; &gt; #4
> &gt; &nbsp;0x000000000701468c in __asan_load8_noabort (addr=0x708fe90) at
> &gt; kasan/kasan.c:315
> &gt; &gt; &gt; #5 &nbsp;0x000000000701aeac in riscv_swint (irq=0,
> &gt; context=0x708fe40, arg=0x0) at common/riscv_swint.c:133
> &gt; &gt; &gt; #6 &nbsp;0x000000000701b8fe in riscv_perform_syscall
> &gt; (regs=0x708fe40) at common/supervisor/riscv_perform_syscall.c:45
> &gt; &gt; &gt; #7 &nbsp;0x0000000007000570 in sys_call6 ()
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; With MM_KASAN_DISABLE_READ_CHECKS=y, it reports:
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; _assert: Assertion failed : at file: mm_heap/mm_malloc.c:245
> &gt; task: rptun process: Kernel 0x704a030
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; The call stack is:
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; #0 &nbsp;_assert (filename=0x7056060 "mm_heap/mm_malloc.c",
> &gt; linenum=245, msg=0x0, regs=0x7082720 &gt; &gt; #2
> &nbsp;0x0000000007013082
> &gt; in mm_malloc (heap=0x7089c00, size=128) at mm_heap/mm_malloc.c:245
> &gt; &gt; &gt; #3 &nbsp;0x0000000007011694 in kmm_malloc (size=128) at
> &gt; kmm_heap/kmm_malloc.c:51
> &gt; &gt; &gt; #4 &nbsp;0x000000000704efd4 in metal_allocate_memory
> (size=128)
> &gt; at .../nuttx/include/metal/system/nuttx/alloc.h:27
> &gt; &gt; &gt; #5 &nbsp;0x000000000704fd8a in rproc_virtio_create_vdev
> (role=1,
> &gt; notifyid=0,
> &gt; &gt; &gt; &nbsp; &nbsp; rsc=0x80200050, rsc_io=0x7080408 &gt; &gt;
> &nbsp;
> &gt; &nbsp; notify=0x704e6d2 &gt; &gt; &nbsp; &nbsp; at
> &gt; open-amp/lib/remoteproc/remoteproc_virtio.c:356
> &gt; &gt; &gt; #6 &nbsp;0x000000000704e956 in remoteproc_create_virtio
> &gt; (rproc=0x708ecd8,
> &gt; &gt; &gt; &nbsp; &nbsp; vdev_id=0, role=1, rst_cb=0x0) at
> &gt; open-amp/lib/remoteproc/remoteproc.c:957
> &gt; &gt; &gt; #7 &nbsp;0x000000000704b1ee in rptun_dev_start
> (rproc=0x708ecd8)
> &gt; &gt; &gt; &nbsp; &nbsp; at rptun/rptun.c:757
> &gt; &gt; &gt; #8 &nbsp;0x0000000007049ff8 in rptun_start_worker
> (arg=0x708eac0)
> &gt; &gt; &gt; &nbsp; &nbsp; at rptun/rptun.c:233
> &gt; &gt; &gt; #9 &nbsp;0x000000000704a0ac in rptun_thread (argc=3,
> &gt; argv=0x7092010)
> &gt; &gt; &gt; &nbsp; &nbsp; at rptun/rptun.c:253
> &gt; &gt; &gt; #10 0x000000000700437e in nxtask_start () at
> &gt; task/task_start.c:107
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; This looks like already corrupted.
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; I also noticed there is a "mm_checkcorruption()" function,
> not
> &gt; sure how to use it yet.
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; Regards,
> &gt; &gt; &gt; yf
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; Original
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; From:"Gregory Nutt"< spudan...@gmail.com &gt;;
> &gt; &gt; &gt;
> &gt; &gt; &gt; Date:2024/3/11 1:43
> &gt; &gt; &gt;
> &gt; &gt; &gt; To:"dev"< dev@nuttx.apache.org &gt;;
> &gt; &gt; &gt;
> &gt; &gt; &gt; Subject:Re: mm/mm_heap assertion error
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; On 3/10/2024 4:38 AM, yfliu2008 wrote:
> &gt; &gt; &gt; &gt; Dear experts,
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; When doing regression check on K230 with a previously
> &gt; working Kernel mode configuration, I got assertion error like below:
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; #0 &nbsp;_assert (filename=0x704c598
> "mm_heap/mm_malloc.c",
> &gt; linenum=245, msg=0x0,regs=0x7082730 &gt; #2 &nbsp;0x00000000070110f0
> in
> &gt; mm_malloc (heap=0x7089c00, size=112) at mm_heap/mm_malloc.c:245
> &gt; &gt; &gt; &gt; #3 &nbsp;0x000000000700fd74 in kmm_malloc (size=112) at
> &gt; kmm_heap/kmm_malloc.c:51
> &gt; &gt; &gt; &gt; #4 &nbsp;0x0000000007028d4e in elf_loadphdrs
> &gt; (loadinfo=0x7090550) at libelf/libelf_sections.c:207
> &gt; &gt; &gt; &gt; #5 &nbsp;0x0000000007028b0c in elf_load
> &gt; (loadinfo=0x7090550)&nbsp; at libelf/libelf_load.c:337
> &gt; &gt; &gt; &gt; #6 &nbsp;0x00000000070278aa in elf_loadbinary
> &gt; (binp=0x708f5d0, filename=0x704bca8 "/system/bin/init", exports=0x0,
> &gt; nexports=0) at elf.c:257
> &gt; &gt; &gt; &gt; #7 &nbsp;0x00000000070293ea in load_absmodule
> &gt; (bin=0x708f5d0, filename=0x704bca8 "/system/bin/init", exports=0x0,
> &gt; nexports=0) at binfmt_loadmodule.c:115
> &gt; &gt; &gt; &gt; #8 &nbsp;0x0000000007029504 in load_module
> (bin=0x708f5d0,
> &gt; filename=0x704bca8 "/system/bin/init", exports=0x0, nexports=0)&nbsp;
> at
> &gt; binfmt_loadmodule.c:219
> &gt; &gt; &gt; &gt; #9 &nbsp;0x0000000007027674 in exec_internal
> &gt; (filename=0x704bca8 "/system/bin/init", argv=0x70907a0, envp=0x0,
> &gt; exports=0x0, nexports=0, actions=0x0, attr=0x7090788, spawn=true) at
> &gt; binfmt_exec.c:98
> &gt; &gt; &gt; &gt; #10 0x000000000702779c in exec_spawn
> (filename=0x704bca8
> &gt; "/system/bin/init", argv=0x70907a0, envp=0x0, exports=0x0, nexports=0,
> &gt; actions=0x0, attr=0x7090788) at binfmt_exec.c:220
> &gt; &gt; &gt; &gt; #11 0x000000000700299e in nx_start_application () at
> &gt; init/nx_bringup.c:375
> &gt; &gt; &gt; &gt; #12 0x00000000070029f0 in nx_start_task (argc=1,
> &gt; argv=0x7090010) at init/nx_bringup.c:403
> &gt; &gt; &gt; &gt; #13 0x0000000007003f84 in nxtask_start () at
> &gt; task/task_start.c:107
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; It looks like mm/mm_heap data structure consistency was
> &gt; broken. As I am unfamilar with these internals, I am looking forward
> &gt; to&nbsp; any hints about how to find the root cause.
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; Regards,
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; yf
> &gt; &gt; &gt;
> &gt; &gt; &gt; This does indicate heap corruption:
> &gt; &gt; &gt;
> &gt; &gt; &gt;      240&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; /* Node next
> must
> &gt; be alloced, otherwise it should be merged.
> &gt; &gt; &gt;      241&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * Its
> &gt; prenode(the founded node) must be free and
> &gt; &gt; &gt;      preceding should
> &gt; &gt; &gt;      242&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * match
> with
> &gt; nodesize.
> &gt; &gt; &gt;      243&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; */
> &gt; &gt; &gt;      244
> &gt; &gt; &gt;      245&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> &gt; DEBUGASSERT(MM_NODE_IS_ALLOC(next) &amp;&amp;
> &gt; &gt; &gt;      MM_PREVNODE_IS_FREE(next) &amp;&amp;
> &gt; &gt; &gt;
> &gt;
> 246&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> &gt; next-&gt;preceding == nodesize);
> &gt; &gt; &gt;
> &gt; &gt; &gt; Heap corruption normally occurs when that this a wild write
> &gt; outside of
> &gt; &gt; &gt; the allocated memory region.&nbsp; These kinds of wild
> writes
> &gt; may clobber
> &gt; &gt; &gt; some other threads data and directory or indirectly clobber
> the
> &gt; heap
> &gt; &gt; &gt; meta data.&nbsp; Trying to traverse the damages heap meta
> data
> &gt; is probably
> &gt; &gt; &gt; the root cause of the problem.
> &gt; &gt; &gt;
> &gt; &gt; &gt; Only a kernel thread or interrupt handler could damage the
> heap.
> &gt; &gt; &gt;
> &gt; &gt; &gt; The cause of this corruption can be really difficult to find
> &gt; because the
> &gt; &gt; &gt; reported error does not occur when the heap is damaged but
> may
> &gt; not
> &gt; &gt; &gt; manifest itself until sometime later.
> &gt; &gt; &gt;
> &gt; &gt; &gt; It is unlikely that anyone will be able to solve this by
> just
> &gt; talking
> &gt; &gt; &gt; about it.&nbsp; It might be worth increasing some kernel
> thread
> &gt; heap sizes
> &gt; &gt; &gt; just to eliminate that common cause.
> &gt; &gt;
> &gt; &gt;

Re: Re: Re:Re: mm/mm_heap assertion error

Reply via email to