Re: Re:Re: mm/mm_heap assertion error

Alan C. Assis Tue, 12 Mar 2024 03:57:43 -0700

You can use the stack monitor to see the stack consumption.

Best Regards,


Alan

On Tue, Mar 12, 2024 at 7:38 AM yfliu2008 <[email protected]> wrote:

> Dear experts,
>
>
>
> After enlarging the stack size of "AppBringUp"&nbsp; thread, the remote
> node can boot NSH on RPMSGFS now. I am sorry for not trying this earlier. I
> was browsing the "rpmsgfs.c" blindly and noticed a few auto variables
> defined in the stack... then I thought it might worth a try so I did it.
>
>
> Now I am still unclear about why small stack leads to heap corruption?
> Also how we can read this stack issue&nbsp; from stackdump logs? Let me
> know if you have any hints.
>
>
> Regards,
> yf
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Original
>
>
>
> From:"yfliu2008"< [email protected] &gt;;
>
> Date:2024/3/12 15:10
>
> To:"dev"< [email protected] &gt;;
>
> Subject:Re:Re: mm/mm_heap assertion error
>
>
> Nathan,
>
>
> Here I disabled RPMsg UART device initialization but the crash still
> happens, I don't see other options to disable for now. On the other hand,
> if we choose not mounting NSH from the RPMSGFS, it can boot smoothly and
> after boot we can manually mount the RPMSGFS for playing.
>
>
>
>
> I uploaded the logs, callstacks and ELFs at
> https://github.com/yf13/hello/tree/debug-logs/nsh-rpmsgfs . There are two
> sets from two ELFs created from same code base but with different DEBUG _xx
> configs.&nbsp; The crash happens earlier in the build with more debug
> options.
>
>
> Please let me know if you have any more suggestions.
>
>
> Regards,
> yf
>
>
>
>
> Original
>
>
>
> From:"Nathan Hartman"< [email protected] &gt;;
>
> Date:2024/3/12 1:27
>
> To:"dev"< [email protected] &gt;;
>
> Subject:Re: mm/mm_heap assertion error
>
>
> What's needed is some way to binary search where the culprit is.
>
> If I understand correctly, it looks like the crash is happening in the
> later stages of board bring-up? What is running before that? Can parts
> be disabled or skipped to see if the problem goes away?
>
> Another idea is to try running a static analysis tool on the sources
> and see if it finds anything suspicious to be looked into more
> carefully.
>
>
> On Mon, Mar 11, 2024 at 10:00 AM Gregory Nutt  wrote:
> &gt;
> &gt; The reason that the error is confusing is because the error probably
> did
> &gt; not occur at the time of the assertion; it probably occurred much
> earlier.
> &gt;
> &gt; In most crashes due to heap corruption there are two players:  the
> &gt; culprit and the victim threads.  The culprit thread actually cause the
> &gt; corruption.  But at the time of the corruption, no error occurs.  The
> &gt; error will not occur until later.
> &gt;
> &gt; So sometime later, the victim thread runs, encounters the clobbered
> heap
> &gt; and crashes.  In this case, "AppBringup" and "rptun" are potential
> &gt; victim threads.  The fact that they crash tell you very little about
> the
> &gt; culprit.
> &gt;
> &gt; On 3/10/2024 6:51 PM, yfliu2008 wrote:
> &gt; &gt; Gregory, thank you for the analysis.
> &gt; &gt;
> &gt; &gt;
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; The crashes happened during system booting up, mostly at
> "AppBringup" or "rptun" threads, as per the assertion logs. The other
> threads existing are the "idle" and the "lpwork" threads as per the sched
> logs. There should be no other threads as NSH creation is still
> ongoing.&nbsp; As for interruptions, the UART and IPI are running in kernel
> space and MTIMER are in NuttSBI space.&nbsp; The NSH is loaded from a
> RPMSGFS volume, thus there are a lot RPMSG communications.
> &gt; &gt;
> &gt; &gt;
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; Is the KASAN proper for use in Kernel mode?&nbsp;
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; With MM_KASAN_ALL it reports a read access error:
> &gt; &gt;
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; BCkasan_report: kasan detected a read access error, address at
> 0x708fe90,size is 8, return address: 0x701aeac
> &gt; &gt;
> &gt; &gt; _assert: Assertion failed panic: at file: kasan/kasan.c:117
> task: Idle_Task process: Kernel 0x70023c0
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; The call stack looks like:
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; #0 &nbsp;_assert (filename=0x7060f78 "kasan/kasan.c",
> linenum=117, msg=0x7060ff0 "panic", regs=0x7082720 &gt; &gt; #2
> &nbsp;0x00000000070141d6 in kasan_report (addr=0x708fe90, size=8,
> is_write=false, return_address=0x701aeac &gt; &gt; #3
> &nbsp;0x0000000007014412 in kasan_check_report (addr=0x708fe90, size=8,
> is_write=false, return_address=0x701aeac &gt; &gt; #4
> &nbsp;0x000000000701468c in __asan_load8_noabort (addr=0x708fe90) at
> kasan/kasan.c:315
> &gt; &gt; #5 &nbsp;0x000000000701aeac in riscv_swint (irq=0,
> context=0x708fe40, arg=0x0) at common/riscv_swint.c:133
> &gt; &gt; #6 &nbsp;0x000000000701b8fe in riscv_perform_syscall
> (regs=0x708fe40) at common/supervisor/riscv_perform_syscall.c:45
> &gt; &gt; #7 &nbsp;0x0000000007000570 in sys_call6 ()
> &gt; &gt;
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; With MM_KASAN_DISABLE_READ_CHECKS=y, it reports:
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; _assert: Assertion failed : at file: mm_heap/mm_malloc.c:245
> task: rptun process: Kernel 0x704a030
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; The call stack is:
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; #0 &nbsp;_assert (filename=0x7056060 "mm_heap/mm_malloc.c",
> linenum=245, msg=0x0, regs=0x7082720 &gt; &gt; #2 &nbsp;0x0000000007013082
> in mm_malloc (heap=0x7089c00, size=128) at mm_heap/mm_malloc.c:245
> &gt; &gt; #3 &nbsp;0x0000000007011694 in kmm_malloc (size=128) at
> kmm_heap/kmm_malloc.c:51
> &gt; &gt; #4 &nbsp;0x000000000704efd4 in metal_allocate_memory (size=128)
> at .../nuttx/include/metal/system/nuttx/alloc.h:27
> &gt; &gt; #5 &nbsp;0x000000000704fd8a in rproc_virtio_create_vdev (role=1,
> notifyid=0,
> &gt; &gt; &nbsp; &nbsp; rsc=0x80200050, rsc_io=0x7080408 &gt; &gt; &nbsp;
> &nbsp; notify=0x704e6d2 &gt; &gt; &nbsp; &nbsp; at
> open-amp/lib/remoteproc/remoteproc_virtio.c:356
> &gt; &gt; #6 &nbsp;0x000000000704e956 in remoteproc_create_virtio
> (rproc=0x708ecd8,
> &gt; &gt; &nbsp; &nbsp; vdev_id=0, role=1, rst_cb=0x0) at
> open-amp/lib/remoteproc/remoteproc.c:957
> &gt; &gt; #7 &nbsp;0x000000000704b1ee in rptun_dev_start (rproc=0x708ecd8)
> &gt; &gt; &nbsp; &nbsp; at rptun/rptun.c:757
> &gt; &gt; #8 &nbsp;0x0000000007049ff8 in rptun_start_worker (arg=0x708eac0)
> &gt; &gt; &nbsp; &nbsp; at rptun/rptun.c:233
> &gt; &gt; #9 &nbsp;0x000000000704a0ac in rptun_thread (argc=3,
> argv=0x7092010)
> &gt; &gt; &nbsp; &nbsp; at rptun/rptun.c:253
> &gt; &gt; #10 0x000000000700437e in nxtask_start () at
> task/task_start.c:107
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; This looks like already corrupted.
> &gt; &gt;
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; I also noticed there is a "mm_checkcorruption()" function, not
> sure how to use it yet.
> &gt; &gt;
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; Regards,
> &gt; &gt; yf
> &gt; &gt;
> &gt; &gt;
> &gt; &gt;
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; Original
> &gt; &gt;
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; From:"Gregory Nutt"< [email protected] &gt;;
> &gt; &gt;
> &gt; &gt; Date:2024/3/11 1:43
> &gt; &gt;
> &gt; &gt; To:"dev"< [email protected] &gt;;
> &gt; &gt;
> &gt; &gt; Subject:Re: mm/mm_heap assertion error
> &gt; &gt;
> &gt; &gt;
> &gt; &gt; On 3/10/2024 4:38 AM, yfliu2008 wrote:
> &gt; &gt; &gt; Dear experts,
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; When doing regression check on K230 with a previously
> working Kernel mode configuration, I got assertion error like below:
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; #0 &nbsp;_assert (filename=0x704c598 "mm_heap/mm_malloc.c",
> linenum=245, msg=0x0,regs=0x7082730 &gt; #2 &nbsp;0x00000000070110f0 in
> mm_malloc (heap=0x7089c00, size=112) at mm_heap/mm_malloc.c:245
> &gt; &gt; &gt; #3 &nbsp;0x000000000700fd74 in kmm_malloc (size=112) at
> kmm_heap/kmm_malloc.c:51
> &gt; &gt; &gt; #4 &nbsp;0x0000000007028d4e in elf_loadphdrs
> (loadinfo=0x7090550) at libelf/libelf_sections.c:207
> &gt; &gt; &gt; #5 &nbsp;0x0000000007028b0c in elf_load
> (loadinfo=0x7090550)&nbsp; at libelf/libelf_load.c:337
> &gt; &gt; &gt; #6 &nbsp;0x00000000070278aa in elf_loadbinary
> (binp=0x708f5d0, filename=0x704bca8 "/system/bin/init", exports=0x0,
> nexports=0) at elf.c:257
> &gt; &gt; &gt; #7 &nbsp;0x00000000070293ea in load_absmodule
> (bin=0x708f5d0, filename=0x704bca8 "/system/bin/init", exports=0x0,
> nexports=0) at binfmt_loadmodule.c:115
> &gt; &gt; &gt; #8 &nbsp;0x0000000007029504 in load_module (bin=0x708f5d0,
> filename=0x704bca8 "/system/bin/init", exports=0x0, nexports=0)&nbsp; at
> binfmt_loadmodule.c:219
> &gt; &gt; &gt; #9 &nbsp;0x0000000007027674 in exec_internal
> (filename=0x704bca8 "/system/bin/init", argv=0x70907a0, envp=0x0,
> exports=0x0, nexports=0, actions=0x0, attr=0x7090788, spawn=true) at
> binfmt_exec.c:98
> &gt; &gt; &gt; #10 0x000000000702779c in exec_spawn (filename=0x704bca8
> "/system/bin/init", argv=0x70907a0, envp=0x0, exports=0x0, nexports=0,
> actions=0x0, attr=0x7090788) at binfmt_exec.c:220
> &gt; &gt; &gt; #11 0x000000000700299e in nx_start_application () at
> init/nx_bringup.c:375
> &gt; &gt; &gt; #12 0x00000000070029f0 in nx_start_task (argc=1,
> argv=0x7090010) at init/nx_bringup.c:403
> &gt; &gt; &gt; #13 0x0000000007003f84 in nxtask_start () at
> task/task_start.c:107
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; It looks like mm/mm_heap data structure consistency was
> broken. As I am unfamilar with these internals, I am looking forward
> to&nbsp; any hints about how to find the root cause.
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; Regards,
> &gt; &gt; &gt;
> &gt; &gt; &gt; yf
> &gt; &gt;
> &gt; &gt; This does indicate heap corruption:
> &gt; &gt;
> &gt; &gt;      240&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; /* Node next must
> be alloced, otherwise it should be merged.
> &gt; &gt;      241&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * Its
> prenode(the founded node) must be free and
> &gt; &gt;      preceding should
> &gt; &gt;      242&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * match with
> nodesize.
> &gt; &gt;      243&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; */
> &gt; &gt;      244
> &gt; &gt;      245&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> DEBUGASSERT(MM_NODE_IS_ALLOC(next) &amp;&amp;
> &gt; &gt;      MM_PREVNODE_IS_FREE(next) &amp;&amp;
> &gt; &gt;
> 246&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> next-&gt;preceding == nodesize);
> &gt; &gt;
> &gt; &gt; Heap corruption normally occurs when that this a wild write
> outside of
> &gt; &gt; the allocated memory region.&nbsp; These kinds of wild writes
> may clobber
> &gt; &gt; some other threads data and directory or indirectly clobber the
> heap
> &gt; &gt; meta data.&nbsp; Trying to traverse the damages heap meta data
> is probably
> &gt; &gt; the root cause of the problem.
> &gt; &gt;
> &gt; &gt; Only a kernel thread or interrupt handler could damage the heap.
> &gt; &gt;
> &gt; &gt; The cause of this corruption can be really difficult to find
> because the
> &gt; &gt; reported error does not occur when the heap is damaged but may
> not
> &gt; &gt; manifest itself until sometime later.
> &gt; &gt;
> &gt; &gt; It is unlikely that anyone will be able to solve this by just
> talking
> &gt; &gt; about it.&nbsp; It might be worth increasing some kernel thread
> heap sizes
> &gt; &gt; just to eliminate that common cause.
> &gt;
> &gt;

Re: Re:Re: mm/mm_heap assertion error

Reply via email to