Hello Jan, More often than not, when I see stat() syscalls hanging, it's due to a communication issue with an OSS rather than an MDS. I think the message about "Lustre: comind-MDT0000: haven't heard from client ..." may be a downstream effect of the client hanging (maybe due to an OSS issue), that causes the client to stop responding to the MDS, rather than the root cause.
(This is just conjecture but it's based on the fact that in my experience when I see the symptoms you have here, it's generally an OSS issue.) Here are some methods I commonly use to check that a client can communicate with each server: client $ lfs df (should return a line for each server) # get the server NID with "lctl list_nids" on the server side, and then for each server, do: client $ lctl ping $SERVER_NID client $ lctl get_param osc.*.state | grep -B1 current (normal states include FULL, IDLE, but it shouldn't say DISCONN or CONNECTING ...) Do those commands reveal any communication issues between the client and any of the servers? - Thomas Bertschinger ________________________________________ From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of Jan Andersen via lustre-discuss <lustre-discuss@lists.lustre.org> Sent: Thursday, February 22, 2024 1:42 AM To: lustre-discuss Subject: [EXTERNAL] [lustre-discuss] open() against files on lustre hangs I have the beginnings of a lustre filesystem, with a server, mds, hosting the MGS and MDS, and a storage node, oss1. The disks, /mgt and /mdt on mds and /ost on oss1 mount, apparently without error. I have set up a client, pxe, which mounts /lustre: root@node080027eb24b8:~# mount -t lustre mds@tcp:/comind /lustre This appears to be successful - from dmesg: ... [Wed Feb 21 10:54:59 2024] libcfs: loading out-of-tree module taints kernel. [Wed Feb 21 10:54:59 2024] libcfs: module verification failed: signature and/or required key missing - tainting kernel [Wed Feb 21 10:54:59 2024] LNet: HW NUMA nodes: 1, HW CPU cores: 1, npartitions: 1 [Wed Feb 21 10:54:59 2024] alg: No test for adler32 (adler32-zlib) [Wed Feb 21 10:55:00 2024] Key type ._llcrypt registered [Wed Feb 21 10:55:00 2024] Key type .llcrypt registered [Wed Feb 21 10:55:00 2024] Lustre: Lustre: Build Version: 2.15.4 [Wed Feb 21 10:55:00 2024] LNet: Added LNI 192.168.50.13@tcp [8/256/0/180] [Wed Feb 21 10:55:00 2024] LNet: Accept secure, port 988 [Wed Feb 21 10:55:02 2024] Lustre: Mounted comind-client I have, after several attempts managed to create a file (or at least a directory entry): root@node080027eb24b8:~# ls /lustre test However, anything that tries to open anything in /lustre - eg, 'ls -l' - just hangs indefinitely, which I suspect is because it is waiting for some sort of response on a network socket. An strace shows: root@node080027eb24b8:~# strace -f /usr/bin/cat /lustre/test ... fstat(3, {st_mode=S_IFREG|0644, st_size=346132, ...}) = 0 mmap(NULL, 346132, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fb3d0994000 close(3) = 0 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0 openat(AT_FDCWD, "/lustre/test", O_RDONLY) = 3 fstat(3, I see no change in dmesg on pxe and oss1, but this on mds: ... [Wed Feb 21 10:50:06 2024] LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc [Wed Feb 21 10:50:44 2024] LDISKFS-fs (sda): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc [Wed Feb 21 10:50:44 2024] Lustre: comind-MDT0000: Imperative Recovery not enabled, recovery window 300-900 [Wed Feb 21 10:51:15 2024] Lustre: comind-OST0000-osc-MDT0000: Connection restored to (at 192.168.50.130@tcp) [Wed Feb 21 10:57:04 2024] Lustre: comind-MDT0000: haven't heard from client 83befb6d-7ee2-4acb-997c-b15520dcb70d (at 192.168.50.13@tcp) in 240 seconds. I think it's dead, and I am evicting it. exp 00000000ddc96899, cur 1708513026 expire 1708512876 last 1708512786 So, something isn't right somewhere in the communication from pxe to mds - but what? _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org