Hello Breno, On Wed, Nov 05, 2025 at 02:18:11AM -0800, Breno Leitao wrote: > Hello Pratyush, > > On Tue, Oct 14, 2025 at 03:10:37PM +0200, Pratyush Yadav wrote: > > On Tue, Oct 14 2025, Breno Leitao wrote: > > > On Mon, Oct 13, 2025 at 06:40:09PM +0200, Pratyush Yadav wrote: > > >> On Mon, Oct 13 2025, Pratyush Yadav wrote: > > >> > > > >> > I suppose this would be useful. I think enabling memblock debug prints > > >> > would also be helpful (using the "memblock=debug" commandline > > >> > parameter) > > >> > if it doesn't impact your production environment too much. > > >> > > >> Actually, I think "memblock=debug" is going to be the more useful thing > > >> since it would also show what function allocated the overlapping range > > >> and the flags it was allocated with. > > >> > > >> On my qemu VM with KVM, this results in around 70 prints from memblock. > > >> So it adds a bit of extra prints but nothing that should be too > > >> disrupting I think. Plus, only at boot so the worst thing you get is > > >> slightly slower boot times. > > > > > > Unfortunately this issue is happening on production systems, and I don't > > > have an easy way to reproduce it _yet_. > > > > > > At the same time, "memblock=debug" has two problems: > > > > > > 1) It slows the boot time as you suggested. Boot time at large > > > environments is SUPER critical and time sensitive. It is a bit > > > weird, but it is common for machines in production to kexec > > > _thousands_ of times, and kexecing is considered downtime. > > > > I don't know if it would make a real enough difference on boot times, > > only that it should theoretically affect it, mainly if you are using > > serial for dmesg logs. Anyway, that's your production environment so you > > know best. > > > > > > > > This would be useful if I find some hosts getting this issue, and > > > then I can easily enable the extra information to collect what > > > I need, but, this didn't pan out because the hosts I got > > > `memblock=debug` didn't collaborate. > > > > > > 2) "memblock=debug" is verbose for all cases, which also not necessary > > > the desired behaviour. I am more interested in only being verbose > > > when there is a known problem. > > I am still interested in this problem, and I finally found a host that > constantly reproduce the issue and I was able to get `memblock=debug` > cmdline. I am running 6.18-rc4 with some debug options enabled. > > DMA-API: exceeded 7 overlapping mappings of cacheline 0x0000000006d6e400 > WARNING: CPU: 58 PID: 828 at kernel/dma/debug.c:463 > add_dma_entry+0x2e4/0x330 > pc : add_dma_entry+0x2e4/0x330 > lr : add_dma_entry+0x2e4/0x330 > sp : ffff8000b036f7f0 > x29: ffff8000b036f800 x28: 0000000000000001 x27: 0000000000000008 > x26: ffff8000835f7fb8 x25: ffff8000835f7000 x24: ffff8000835f7ee0 > x23: 0000000000000000 x22: 0000000006d6e400 x21: 0000000000000000 > x20: 0000000006d6e400 x19: ffff0003f70c1100 x18: 00000000ffffffff > x17: ffff80008019a2d8 x16: ffff80008019a08c x15: 0000000000000000 > x14: 0000000000000000 x13: 0000000000000820 x12: ffff00011faeaf00 > x11: 0000000000000000 x10: ffff8000834633d8 x9 : ffff8000801979d4 > x8 : 00000000fffeffff x7 : ffff8000834633d8 x6 : 0000000000000000 > x5 : 00000000000bfff4 x4 : 0000000000000000 x3 : ffff0001075eb7c0 > x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0001075eb7c0 > Call trace: > add_dma_entry+0x2e4/0x330 (P) > debug_dma_map_phys+0xc4/0xf0 > dma_map_phys > (/home/leit/Devel/upstream/./include/linux/dma-direct.h:138 > /home/leit/Devel/upstream/kernel/dma/direct.h:102 > /home/leit/Devel/upstream/kernel/dma/mapping.c:169) > dma_map_page_attrs (/home/leit/Devel/upstream/kernel/dma/mapping.c:387) > blk_dma_map_direct.isra.0 > (/home/leit/Devel/upstream/block/blk-mq-dma.c:102) > blk_dma_map_iter_start > (/home/leit/Devel/upstream/block/blk-mq-dma.c:123 > /home/leit/Devel/upstream/block/blk-mq-dma.c:196) > blk_rq_dma_map_iter_start > (/home/leit/Devel/upstream/block/blk-mq-dma.c:228) > nvme_prep_rq+0xb8/0x9b8 > nvme_queue_rq+0x44/0x1b0 > blk_mq_dispatch_rq_list (/home/leit/Devel/upstream/block/blk-mq.c:2129) > __blk_mq_sched_dispatch_requests > (/home/leit/Devel/upstream/block/blk-mq-sched.c:314) > blk_mq_sched_dispatch_requests > (/home/leit/Devel/upstream/block/blk-mq-sched.c:329) > blk_mq_run_work_fn (/home/leit/Devel/upstream/block/blk-mq.c:219 > /home/leit/Devel/upstream/block/blk-mq.c:231) > process_one_work (/home/leit/Devel/upstream/kernel/workqueue.c:991 > /home/leit/Devel/upstream/kernel/workqueue.c:3213) > worker_thread (/home/leit/Devel/upstream/./include/linux/list.h:163 > /home/leit/Devel/upstream/./include/linux/list.h:191 > /home/leit/Devel/upstream/./include/linux/list.h:319 > /home/leit/Devel/upstream/kernel/workqueue.c:1153 > /home/leit/Devel/upstream/kernel/workqueue.c:1205 > /home/leit/Devel/upstream/kernel/workqueue.c:3426) > kthread (/home/leit/Devel/upstream/kernel/kthread.c:386 > /home/leit/Devel/upstream/kernel/kthread.c:457) > ret_from_fork (/home/leit/Devel/upstream/entry.S:861) > > > Looking at memblock debug logs, I haven't seen anything related to > 0x0000000006d6e400.
It looks like the crash happens way after memblock passed all the memory to buddy. Why do you think this is related to memblock? > I got the output of `dmesg | grep memblock` in, in case you are curious: > > > https://github.com/leitao/debug/blob/main/pastebin/memblock/dmesg_grep_memblock.txt > > Thanks > --breno > -- Sincerely yours, Mike.
