Hello- I am running a 2.16.1 client (g:718a3fd9c78507a589a657ec3745b4c553f66633) on Ubuntu 24.04.1 LTS (kernel 6.8.0-51). The client is using infiniband with in-kernel ofed drivers. I am seeing a mostly reproducible (although subject to race conditions) issue that occurs when a client server runs out of memory. I see messages like the following:
2025-02-04T18:02:01.300296-05:00 hostname kernel: obd_memory max: 1282184920, obd_memory current: 11174853 2025-02-04T18:02:01.311854-05:00 hostname kernel: obd_memory max: 1282184920, obd_memory current: 11174453 2025-02-04T18:02:01.417172-05:00 hostname kernel: message repeated 14 times: [ obd_memory max: 1282184920, obd_memory current: 11174453] 2025-02-04T18:02:01.417173-05:00 hostname kernel: obd_memory max: 1282184920, obd_memory current: 11174021 2025-02-04T18:02:01.417184-05:00 hostname kernel: obd_memory max: 1282184920, obd_memory current: 11173805 which are output to the syslog forever with ptlrpcd spinning 100% CPU, while lustre client processes which were waiting on I/O are hung. In this instance, Linux oom-killer was invoked and killed a process, it output a bit later after the first errors from obdclass_oom_handler: 2025-02-04T18:02:01.420654-05:00 hostname kernel: awk invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0 2025-02-04T18:02:01.420663-05:00 hostname kernel: CPU: 1 PID: 49999 Comm: awk Tainted: G OE 6.8.0-51-generic #52-Ubuntu 2025-02-04T18:02:01.420664-05:00 hostname kernel: Hardware name: Supermicro X8DTU-6+/X8DTU-6+, BIOS 2.1a 07/08/11 2025-02-04T18:02:01.420665-05:00 hostname kernel: Call Trace: 2025-02-04T18:02:01.420665-05:00 hostname kernel: <TASK> 2025-02-04T18:02:01.420666-05:00 hostname kernel: dump_stack_lvl+0x76/0xa0 2025-02-04T18:02:01.420666-05:00 hostname kernel: dump_stack+0x10/0x20 2025-02-04T18:02:01.420667-05:00 hostname kernel: dump_header+0x47/0x1f0 2025-02-04T18:02:01.420668-05:00 hostname kernel: oom_kill_process+0x118/0x280 2025-02-04T18:02:01.420668-05:00 hostname kernel: ? oom_evaluate_task+0x143/0x1e0 2025-02-04T18:02:01.420669-05:00 hostname kernel: out_of_memory+0x103/0x350 2025-02-04T18:02:01.420669-05:00 hostname kernel: __alloc_pages_may_oom+0x10c/0x1d0 2025-02-04T18:02:01.420670-05:00 hostname kernel: __alloc_pages_slowpath.constprop.0+0x420/0x9f0 2025-02-04T18:02:01.420670-05:00 hostname kernel: __alloc_pages+0x31f/0x350 2025-02-04T18:02:01.420671-05:00 hostname kernel: alloc_pages_mpol+0x91/0x210 2025-02-04T18:02:01.420671-05:00 hostname kernel: folio_alloc+0x64/0x120 2025-02-04T18:02:01.420672-05:00 hostname kernel: ? filemap_get_entry+0xe5/0x160 2025-02-04T18:02:01.420672-05:00 hostname kernel: filemap_alloc_folio+0xf4/0x100 2025-02-04T18:02:01.420673-05:00 hostname kernel: __filemap_get_folio+0x14b/0x2f0 2025-02-04T18:02:01.420673-05:00 hostname kernel: filemap_fault+0x15c/0x8e0 2025-02-04T18:02:01.420674-05:00 hostname kernel: __do_fault+0x38/0x140 2025-02-04T18:02:01.420674-05:00 hostname kernel: do_read_fault+0x133/0x1d0 2025-02-04T18:02:01.420675-05:00 hostname kernel: do_fault+0x109/0x350 2025-02-04T18:02:01.420675-05:00 hostname kernel: handle_pte_fault+0x114/0x1d0 2025-02-04T18:02:01.420676-05:00 hostname kernel: __handle_mm_fault+0x653/0x790 2025-02-04T18:02:01.420677-05:00 hostname kernel: handle_mm_fault+0x18a/0x380 2025-02-04T18:02:01.420677-05:00 hostname kernel: do_user_addr_fault+0x169/0x670 2025-02-04T18:02:01.420678-05:00 hostname kernel: exc_page_fault+0x83/0x1b0 2025-02-04T18:02:01.420678-05:00 hostname kernel: asm_exc_page_fault+0x27/0x30 2025-02-04T18:02:01.420705-05:00 hostname kernel: RIP: 0033:0x44b932 2025-02-04T18:02:01.420707-05:00 hostname kernel: Code: Unable to access opcode bytes at 0x44b908. 2025-02-04T18:02:01.420707-05:00 hostname kernel: RSP: 002b:00007ffca65a1ac0 EFLAGS: 00010246 2025-02-04T18:02:01.420709-05:00 hostname kernel: RAX: 0000000039a2b350 RBX: 00000000399e1650 RCX: 373130325f646573 2025-02-04T18:02:01.420709-05:00 hostname kernel: RDX: 00000000399d51d0 RSI: 325f6465736f6c63 RDI: 00000000399d51d0 2025-02-04T18:02:01.420710-05:00 hostname kernel: RBP: 00007ffca65a1b70 R08: 00000000399cb010 R09: 0000000000000007 2025-02-04T18:02:01.420710-05:00 hostname kernel: R10: 00000000399d51d0 R11: fe1f64ced69f9b4f R12: 0000000000000000 2025-02-04T18:02:01.420711-05:00 hostname kernel: R13: 00000000004a3fa0 R14: 0000000039a0b080 R15: 0000000039a2b2e0 2025-02-04T18:02:01.420711-05:00 hostname kernel: </TASK> 2025-02-04T18:02:01.420712-05:00 hostname kernel: Mem-Info: 2025-02-04T18:02:01.420713-05:00 hostname kernel: active_anon:10285863 inactive_anon:1280438 isolated_anon:0 2025-02-04T18:02:01.420713-05:00 hostname kernel: active_file:312 inactive_file:750 isolated_file:0 2025-02-04T18:02:01.420714-05:00 hostname kernel: unevictable:41500 dirty:34 writeback:0 2025-02-04T18:02:01.420715-05:00 hostname kernel: slab_reclaimable:21229 slab_unreclaimable:33598 2025-02-04T18:02:01.420715-05:00 hostname kernel: mapped:12223 shmem:1218 pagetables:24740 2025-02-04T18:02:01.420715-05:00 hostname kernel: sec_pagetables:0 bounce:0 2025-02-04T18:02:01.420716-05:00 hostname kernel: kernel_misc_reclaimable:0 2025-02-04T18:02:01.420716-05:00 hostname kernel: free:64845 free_pcp:0 free_cma:0 2025-02-04T18:02:01.420717-05:00 hostname kernel: Node 0 active_anon:41143452kB inactive_anon:5121752kB active_file:1572kB inactive_file:2676kB unevictable:166000kB isolated(anon):0kB isolated(file):0kB mapped :48892kB dirty:136kB writeback:0kB shmem:4872kB shmem_thp:0kB shmem_pmdmapped:0kB anon_thp:0kB writeback_tmp:0kB kernel_stack:7360kB pagetables:98960kB sec_pagetables:0kB all_unreclaimable? no 2025-02-04T18:02:01.420721-05:00 hostname kernel: Node 0 DMA free:11264kB boost:0kB min:20kB low:32kB high:44kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevi ctable:0kB writepending:0kB present:15964kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB 2025-02-04T18:02:01.420722-05:00 hostname kernel: lowmem_reserve[]: 0 2926 48083 48083 48083 2025-02-04T18:02:01.420723-05:00 hostname kernel: Node 0 DMA32 free:184676kB boost:0kB min:4112kB low:7108kB high:10104kB reserved_highatomic:0KB active_anon:2755536kB inactive_anon:107204kB active_file:0kB in active_file:732kB unevictable:0kB writepending:4kB present:3120640kB managed:3055104kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB 2025-02-04T18:02:01.420723-05:00 hostname kernel: lowmem_reserve[]: 0 0 45156 45156 45156 2025-02-04T18:02:01.420724-05:00 hostname kernel: Node 0 Normal free:63440kB boost:0kB min:63448kB low:109688kB high:155928kB reserved_highatomic:0KB active_anon:38387492kB inactive_anon:5014972kB active_file: 1288kB inactive_file:2392kB unevictable:166000kB writepending:0kB present:47185920kB managed:46247332kB mlocked:162928kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB 2025-02-04T18:02:01.420725-05:00 hostname kernel: lowmem_reserve[]: 0 0 0 0 0 2025-02-04T18:02:01.420725-05:00 hostname kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB 2025-02-04T18:02:01.420726-05:00 hostname kernel: Node 0 DMA32: 35*4kB (UME) 60*8kB (UME) 127*16kB (UME) 119*32kB (UME) 117*64kB (UME) 91*128kB (UME) 83*256kB (UME) 65*512kB (UME) 40*1024kB (UME) 25*2048kB (UM E) 3*4096kB (UE) = 184572kB 2025-02-04T18:02:01.420727-05:00 hostname kernel: Node 0 Normal: 1394*4kB (UME) 1801*8kB (UME) 1236*16kB (UME) 566*32kB (UME) 97*64kB (UM) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 64208kB 2025-02-04T18:02:01.420728-05:00 hostname kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB 2025-02-04T18:02:01.420728-05:00 hostname kernel: Node 0 hugepages_total=1024 hugepages_free=1024 hugepages_surp=0 hugepages_size=2048kB 2025-02-04T18:02:01.420729-05:00 hostname kernel: 14516 total pagecache pages 2025-02-04T18:02:01.420729-05:00 hostname kernel: 0 pages in swap cache 2025-02-04T18:02:01.420730-05:00 hostname kernel: Free swap = 13631420kB 2025-02-04T18:02:01.420730-05:00 hostname kernel: Total swap = 13631420kB 2025-02-04T18:02:01.420730-05:00 hostname kernel: 12580631 pages RAM 2025-02-04T18:02:01.420731-05:00 hostname kernel: 0 pages HighMem/MovableOnly 2025-02-04T18:02:01.420731-05:00 hostname kernel: 251182 pages reserved 2025-02-04T18:02:01.420732-05:00 hostname kernel: 0 pages hwpoisoned ... 2025-02-04T18:02:01.420887-05:00 hostname kernel: obd_memory max: 1282184920, obd_memory current: 11195973 2025-02-04T18:02:01.421187-05:00 hostname kernel: message repeated 269 times: [ obd_memory max: 1282184920, obd_memory current: 11195973] 2025-02-04T18:02:01.421187-05:00 hostname kernel: obd_memory max: 1282184920, obd_memory current: 11196765 2025-02-04T18:02:01.421192-05:00 hostname kernel: obd_memory max: 1282184920, obd_memory current: 11200645 2025-02-04T18:02:01.421193-05:00 hostname kernel: obd_memory max: 1282184920, obd_memory current: 11217029 2025-02-04T18:02:01.421194-05:00 hostname kernel: obd_memory max: 1282184920, obd_memory current: 11217029 2025-02-04T18:02:01.421195-05:00 hostname kernel: obd_memory max: 1282184920, obd_memory current: 11217429 2025-02-04T18:02:01.421195-05:00 hostname kernel: obd_memory max: 1282184920, obd_memory current: 11217429 2025-02-04T18:02:01.421196-05:00 hostname kernel: obd_memory max: 1282184920, obd_memory current: 11216645 2025-02-04T18:02:01.421197-05:00 hostname kernel: obd_memory max: 1282184920, obd_memory current: 11197597 2025-02-04T18:02:01.421198-05:00 hostname kernel: obd_memory max: 1282184920, obd_memory current: 11208125 It doesn't happen in every test, presumably it just requires the out-of-memory situation to arise at a certain point in the execution, but I can make it happen pretty reliably by using up all the RAM while a process is reading from the lustre filesystem. Is this a known issue already? Thanks! -Lewis _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
