Hello, I have been trying to chase some errors on our machines that are leading to OOM killer killing some processes. Sometimes i see OOM errors and lustre errors around the same time but so far i have failed to find out if OOM killer is causing lustre errors or lustre errors (possibly caused by network errors) are the cause of OOM (the theory here is that there are dirty blocks in page cache that cannot be freed because there is a network delays in writing them to the server).
>From the userspace it is visible either as a process being killed by oom or erroring out with I/O error (-5) while reading or writing a file on Lustre file system. Below is the case when it failed because of I/O error. What would be the steps to diagnose problem? Lustre: Lustre: Build Version: 2.10.1 Linux hnod2004 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux 172.27.31.151: Dec 15 19:17:00 hnod2004 kernel: INFO: task awk:108160 blocked for more than 120 seconds. Dec 15 19:17:00 hnod2004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 15 19:17:00 hnod2004 kernel: awk D ffffffff816a76b0 0 108160 108148 0x00000080 Dec 15 19:17:00 hnod2004 kernel: ffff8802155f3800 0000000000000002 ffff8804c1b41fa0 ffff8802155f3fd8 Dec 15 19:17:00 hnod2004 kernel: ffff8802155f3fd8 ffff8802155f3fd8 ffff8804c1b41fa0 ffff881fbbb96cc0 Dec 15 19:17:00 hnod2004 kernel: 0000000000000000 7fffffffffffffff ffff88203ff91de8 ffffffff816a76b0 Dec 15 19:17:00 hnod2004 kernel: Call Trace: Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a76b0>] ? bit_wait+0x50/0x50 Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a9589>] schedule+0x29/0x70 Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a7099>] schedule_timeout+0x239/0x2c0 Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc0a35318>] ? ldlm_lock_remove_from_lru_check+0x158/0x1a0 [ptlrpc] Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc0a36899>] ? ldlm_lock_addref_internal_nolock+0x19/0x100 [ptlrpc] Dec 15 19:17:00 hnod2004 kernel: [<ffffffff810e93ac>] ? ktime_get_ts64+0x4c/0xf0 Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a76b0>] ? bit_wait+0x50/0x50 Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a8c0d>] io_schedule_timeout+0xad/0x130 Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a8ca8>] io_schedule+0x18/0x20 Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a76c1>] bit_wait_io+0x11/0x50 Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816a71e5>] __wait_on_bit+0x65/0x90 Dec 15 19:17:00 hnod2004 kernel: [<ffffffff81181cc1>] wait_on_page_bit+0x81/0xa0 Dec 15 19:17:00 hnod2004 kernel: [<ffffffff810b19e0>] ? wake_bit_function+0x40/0x40 Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc1167bb2>] vvp_page_assume+0x32/0xa0 [lustre] Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc061d1f8>] cl_page_assume+0x68/0x240 [obdclass] Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc1159df0>] ll_write_begin+0x100/0x830 [lustre] Dec 15 19:17:00 hnod2004 kernel: [<ffffffff811824de>] generic_file_buffered_write+0x11e/0x2a0 Dec 15 19:17:00 hnod2004 kernel: [<ffffffff81184902>] __generic_file_aio_write+0x1e2/0x400 Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc1168d5b>] __generic_file_write_iter+0xcb/0x340 [lustre] Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc116cc46>] vvp_io_write_start+0x4a6/0x810 [lustre] Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc0620217>] ? cl_lock_request+0x67/0x1f0 [obdclass] Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc0621e25>] cl_io_start+0x65/0x130 [obdclass] Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc06241ee>] cl_io_loop+0x12e/0xc90 [obdclass] Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc1118258>] ll_file_io_generic+0x498/0xc40 [lustre] Dec 15 19:17:00 hnod2004 kernel: [<ffffffff811bc401>] ? page_add_file_rmap+0x1/0xc0 Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc1118cdd>] ll_file_aio_write+0x12d/0x1f0 [lustre] Dec 15 19:17:00 hnod2004 kernel: [<ffffffffc1118e6e>] ll_file_write+0xce/0x1e0 [lustre] Dec 15 19:17:00 hnod2004 kernel: [<ffffffff81200cad>] vfs_write+0xbd/0x1e0 Dec 15 19:17:00 hnod2004 kernel: [<ffffffff8111f394>] ? __audit_syscall_entry+0xb4/0x110 Dec 15 19:17:00 hnod2004 kernel: [<ffffffff81201abf>] SyS_write+0x7f/0xe0 Dec 15 19:17:00 hnod2004 kernel: [<ffffffff816b5292>] tracesys+0xdd/0xe2 Dec 15 19:18:12 hnod2004 kernel: LustreError: 11-0: h4-OST002e-osc-ffff881f01e01800: operation obd_ping to node 172.27.0.64@tcp failed: rc = -107 Dec 15 19:18:12 hnod2004 kernel: Lustre: h4-OST002e-osc-ffff881f01e01800: Connection to h4-OST002e (at 172.27.0.64@tcp) was lost; in progress operations using this service will wait for recovery to complete Dec 15 19:18:12 hnod2004 kernel: LustreError: 167-0: h4-OST002e-osc-ffff881f01e01800: This client was evicted by h4-OST002e; in progress operations using this service will fail. OST logs(172.27.0.64): [Sat Dec 15 19:12:55 2018] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 172.27.31.151@tcp ns: filter-h4-OST002e_UUID lock: ffff88101d9cae00/0x7f97843f44bf7e83 lrc: 3/0,0 mode: PW/PW res: [0xd64612:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400000020 nid: 172.27.31.151@tcp remote: 0xa2b3e4825f0f14ba expref: 19 pid: 15538 timeout: 9310208750 lvb_type: 0 [Sat Dec 15 19:12:56 2018] LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) Skipped 1 previous similar message Regards. -- Jacek Tomaka Geophysical Software Developer DownUnder GeoSolutions 76 Kings Park Road West Perth 6005 WA, Australia tel +61 8 9287 4143 [email protected] www.dug.com _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
