You can use the stack monitor to see the stack consumption. Best Regards,
Alan On Tue, Mar 12, 2024 at 7:38 AM yfliu2008 <yfliu2...@qq.com.invalid> wrote: > Dear experts, > > > > After enlarging the stack size of "AppBringUp" thread, the remote > node can boot NSH on RPMSGFS now. I am sorry for not trying this earlier. I > was browsing the "rpmsgfs.c" blindly and noticed a few auto variables > defined in the stack... then I thought it might worth a try so I did it. > > > Now I am still unclear about why small stack leads to heap corruption? > Also how we can read this stack issue from stackdump logs? Let me > know if you have any hints. > > > Regards, > yf > > > > > > > > > > > > > > > > > > Original > > > > From:"yfliu2008"< yfliu2...@qq.com >; > > Date:2024/3/12 15:10 > > To:"dev"< dev@nuttx.apache.org >; > > Subject:Re:Re: mm/mm_heap assertion error > > > Nathan, > > > Here I disabled RPMsg UART device initialization but the crash still > happens, I don't see other options to disable for now. On the other hand, > if we choose not mounting NSH from the RPMSGFS, it can boot smoothly and > after boot we can manually mount the RPMSGFS for playing. > > > > > I uploaded the logs, callstacks and ELFs at > https://github.com/yf13/hello/tree/debug-logs/nsh-rpmsgfs . There are two > sets from two ELFs created from same code base but with different DEBUG _xx > configs. The crash happens earlier in the build with more debug > options. > > > Please let me know if you have any more suggestions. > > > Regards, > yf > > > > > Original > > > > From:"Nathan Hartman"< hartman.nat...@gmail.com >; > > Date:2024/3/12 1:27 > > To:"dev"< dev@nuttx.apache.org >; > > Subject:Re: mm/mm_heap assertion error > > > What's needed is some way to binary search where the culprit is. > > If I understand correctly, it looks like the crash is happening in the > later stages of board bring-up? What is running before that? Can parts > be disabled or skipped to see if the problem goes away? > > Another idea is to try running a static analysis tool on the sources > and see if it finds anything suspicious to be looked into more > carefully. > > > On Mon, Mar 11, 2024 at 10:00 AM Gregory Nutt wrote: > > > > The reason that the error is confusing is because the error probably > did > > not occur at the time of the assertion; it probably occurred much > earlier. > > > > In most crashes due to heap corruption there are two players: the > > culprit and the victim threads. The culprit thread actually cause the > > corruption. But at the time of the corruption, no error occurs. The > > error will not occur until later. > > > > So sometime later, the victim thread runs, encounters the clobbered > heap > > and crashes. In this case, "AppBringup" and "rptun" are potential > > victim threads. The fact that they crash tell you very little about > the > > culprit. > > > > On 3/10/2024 6:51 PM, yfliu2008 wrote: > > > Gregory, thank you for the analysis. > > > > > > > > > > > > > > > The crashes happened during system booting up, mostly at > "AppBringup" or "rptun" threads, as per the assertion logs. The other > threads existing are the "idle" and the "lpwork" threads as per the sched > logs. There should be no other threads as NSH creation is still > ongoing. As for interruptions, the UART and IPI are running in kernel > space and MTIMER are in NuttSBI space. The NSH is loaded from a > RPMSGFS volume, thus there are a lot RPMSG communications. > > > > > > > > > > > > > > > Is the KASAN proper for use in Kernel mode? > > > > > > > > > With MM_KASAN_ALL it reports a read access error: > > > > > > > > > > > > BCkasan_report: kasan detected a read access error, address at > 0x708fe90,size is 8, return address: 0x701aeac > > > > > > _assert: Assertion failed panic: at file: kasan/kasan.c:117 > task: Idle_Task process: Kernel 0x70023c0 > > > > > > > > > The call stack looks like: > > > > > > > > > #0 _assert (filename=0x7060f78 "kasan/kasan.c", > linenum=117, msg=0x7060ff0 "panic", regs=0x7082720 > > #2 > 0x00000000070141d6 in kasan_report (addr=0x708fe90, size=8, > is_write=false, return_address=0x701aeac > > #3 > 0x0000000007014412 in kasan_check_report (addr=0x708fe90, size=8, > is_write=false, return_address=0x701aeac > > #4 > 0x000000000701468c in __asan_load8_noabort (addr=0x708fe90) at > kasan/kasan.c:315 > > > #5 0x000000000701aeac in riscv_swint (irq=0, > context=0x708fe40, arg=0x0) at common/riscv_swint.c:133 > > > #6 0x000000000701b8fe in riscv_perform_syscall > (regs=0x708fe40) at common/supervisor/riscv_perform_syscall.c:45 > > > #7 0x0000000007000570 in sys_call6 () > > > > > > > > > > > > With MM_KASAN_DISABLE_READ_CHECKS=y, it reports: > > > > > > > > > _assert: Assertion failed : at file: mm_heap/mm_malloc.c:245 > task: rptun process: Kernel 0x704a030 > > > > > > > > > The call stack is: > > > > > > > > > #0 _assert (filename=0x7056060 "mm_heap/mm_malloc.c", > linenum=245, msg=0x0, regs=0x7082720 > > #2 0x0000000007013082 > in mm_malloc (heap=0x7089c00, size=128) at mm_heap/mm_malloc.c:245 > > > #3 0x0000000007011694 in kmm_malloc (size=128) at > kmm_heap/kmm_malloc.c:51 > > > #4 0x000000000704efd4 in metal_allocate_memory (size=128) > at .../nuttx/include/metal/system/nuttx/alloc.h:27 > > > #5 0x000000000704fd8a in rproc_virtio_create_vdev (role=1, > notifyid=0, > > > rsc=0x80200050, rsc_io=0x7080408 > > > notify=0x704e6d2 > > at > open-amp/lib/remoteproc/remoteproc_virtio.c:356 > > > #6 0x000000000704e956 in remoteproc_create_virtio > (rproc=0x708ecd8, > > > vdev_id=0, role=1, rst_cb=0x0) at > open-amp/lib/remoteproc/remoteproc.c:957 > > > #7 0x000000000704b1ee in rptun_dev_start (rproc=0x708ecd8) > > > at rptun/rptun.c:757 > > > #8 0x0000000007049ff8 in rptun_start_worker (arg=0x708eac0) > > > at rptun/rptun.c:233 > > > #9 0x000000000704a0ac in rptun_thread (argc=3, > argv=0x7092010) > > > at rptun/rptun.c:253 > > > #10 0x000000000700437e in nxtask_start () at > task/task_start.c:107 > > > > > > > > > This looks like already corrupted. > > > > > > > > > > > > I also noticed there is a "mm_checkcorruption()" function, not > sure how to use it yet. > > > > > > > > > > > > Regards, > > > yf > > > > > > > > > > > > > > > > > > Original > > > > > > > > > > > > From:"Gregory Nutt"< spudan...@gmail.com >; > > > > > > Date:2024/3/11 1:43 > > > > > > To:"dev"< dev@nuttx.apache.org >; > > > > > > Subject:Re: mm/mm_heap assertion error > > > > > > > > > On 3/10/2024 4:38 AM, yfliu2008 wrote: > > > > Dear experts, > > > > > > > > > > > > > > > > > > > > When doing regression check on K230 with a previously > working Kernel mode configuration, I got assertion error like below: > > > > > > > > > > > > > > > > #0 _assert (filename=0x704c598 "mm_heap/mm_malloc.c", > linenum=245, msg=0x0,regs=0x7082730 > #2 0x00000000070110f0 in > mm_malloc (heap=0x7089c00, size=112) at mm_heap/mm_malloc.c:245 > > > > #3 0x000000000700fd74 in kmm_malloc (size=112) at > kmm_heap/kmm_malloc.c:51 > > > > #4 0x0000000007028d4e in elf_loadphdrs > (loadinfo=0x7090550) at libelf/libelf_sections.c:207 > > > > #5 0x0000000007028b0c in elf_load > (loadinfo=0x7090550) at libelf/libelf_load.c:337 > > > > #6 0x00000000070278aa in elf_loadbinary > (binp=0x708f5d0, filename=0x704bca8 "/system/bin/init", exports=0x0, > nexports=0) at elf.c:257 > > > > #7 0x00000000070293ea in load_absmodule > (bin=0x708f5d0, filename=0x704bca8 "/system/bin/init", exports=0x0, > nexports=0) at binfmt_loadmodule.c:115 > > > > #8 0x0000000007029504 in load_module (bin=0x708f5d0, > filename=0x704bca8 "/system/bin/init", exports=0x0, nexports=0) at > binfmt_loadmodule.c:219 > > > > #9 0x0000000007027674 in exec_internal > (filename=0x704bca8 "/system/bin/init", argv=0x70907a0, envp=0x0, > exports=0x0, nexports=0, actions=0x0, attr=0x7090788, spawn=true) at > binfmt_exec.c:98 > > > > #10 0x000000000702779c in exec_spawn (filename=0x704bca8 > "/system/bin/init", argv=0x70907a0, envp=0x0, exports=0x0, nexports=0, > actions=0x0, attr=0x7090788) at binfmt_exec.c:220 > > > > #11 0x000000000700299e in nx_start_application () at > init/nx_bringup.c:375 > > > > #12 0x00000000070029f0 in nx_start_task (argc=1, > argv=0x7090010) at init/nx_bringup.c:403 > > > > #13 0x0000000007003f84 in nxtask_start () at > task/task_start.c:107 > > > > > > > > > > > > > > > > It looks like mm/mm_heap data structure consistency was > broken. As I am unfamilar with these internals, I am looking forward > to any hints about how to find the root cause. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > yf > > > > > > This does indicate heap corruption: > > > > > > 240 /* Node next must > be alloced, otherwise it should be merged. > > > 241 * Its > prenode(the founded node) must be free and > > > preceding should > > > 242 * match with > nodesize. > > > 243 */ > > > 244 > > > 245 > DEBUGASSERT(MM_NODE_IS_ALLOC(next) && > > > MM_PREVNODE_IS_FREE(next) && > > > > 246 > next->preceding == nodesize); > > > > > > Heap corruption normally occurs when that this a wild write > outside of > > > the allocated memory region. These kinds of wild writes > may clobber > > > some other threads data and directory or indirectly clobber the > heap > > > meta data. Trying to traverse the damages heap meta data > is probably > > > the root cause of the problem. > > > > > > Only a kernel thread or interrupt handler could damage the heap. > > > > > > The cause of this corruption can be really difficult to find > because the > > > reported error does not occur when the heap is damaged but may > not > > > manifest itself until sometime later. > > > > > > It is unlikely that anyone will be able to solve this by just > talking > > > about it. It might be worth increasing some kernel thread > heap sizes > > > just to eliminate that common cause. > > > >