Hello, Occasionally when we put a client, typically a head node, under very heavy load, it freezes all operations on the Lustre mount and requires a hard reboot before the mount is usable again. The symptoms look similar to the statahead problem observed by others, but I was under the impression that this wouldn't be an issue in 1.6.5.1, the version that we're running. On the client, the messages in the log file are:
Aug 19 12:42:47 herologin1 kernel: LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The mds_statfs operation failed with -107 Aug 19 12:42:47 herologin1 kernel: Lustre: circelfs-MDT0000-mdc-ffff81021eabdc00: Connection to service circelfs-MDT0000 via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. Aug 19 12:42:47 herologin1 kernel: LustreError: 167-0: This client was evicted by circelfs-MDT0000; in progress operations using this service will fail. Aug 19 12:42:47 herologin1 kernel: LustreError: 7067:0:(llite_lib.c:1549:ll_statfs_internal()) mdc_statfs fails: rc = -5 while on the MGS/MDS the messages are: Aug 19 12:41:11 circe1 kernel: Lustre: MGS: haven't heard from client 1be0f382-ff65-f231-d348-9d2523654fbb (at [EMAIL PROTECTED]) in 1127 seconds. I think it's dead, and I am evicting it. Aug 19 12:41:11 circe1 kernel: Lustre: Skipped 2 previous similar messages Aug 19 12:41:50 circe1 kernel: Lustre: circelfs-OST001f: haven't heard from client b7bbe482-a8c5-0d21-4f44-713aa2aa4f81 (at [EMAIL PROTECTED]) in 1127 seconds. I think it's dead, and I am evicting it. Aug 19 12:41:51 circe1 kernel: Lustre: circelfs-OST0019: haven't heard from client b7bbe482-a8c5-0d21-4f44-713aa2aa4f81 (at [EMAIL PROTECTED]) in 1127 seconds. I think it's dead, and I am evicting it. Aug 19 12:41:52 circe1 kernel: Lustre: circelfs-OST001a: haven't heard from client b7bbe482-a8c5-0d21-4f44-713aa2aa4f81 (at [EMAIL PROTECTED]) in 1127 seconds. I think it's dead, and I am evicting it. Aug 19 12:42:15 circe1 kernel: Lustre: circelfs-MDT0000: haven't heard from client b7bbe482-a8c5-0d21-4f44-713aa2aa4f81 (at [EMAIL PROTECTED]) in 1127 seconds. I think it's dead, and I am evicting it. Aug 19 12:42:15 circe1 kernel: Lustre: Skipped 4 previous similar messages Aug 19 12:42:47 circe1 kernel: LustreError: 7735:0:(handler.c:1515:mds_handle()) operation 41 on unconnected MDS from [EMAIL PROTECTED] Aug 19 12:42:47 circe1 kernel: LustreError: 7735:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x12738961/t0 o41-><?>@<?>:0/0 lens 128/0 e 0 to 0 dl 1219164667 ref 1 fl Interpret:/0/0 rc -107/0 Aug 19 12:42:47 circe1 kernel: LustreError: 7735:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 41 previous similar messages (In the logs above 10.242.40.14 = herologin1). Should I try the echo 0 > /proc/fs/lustre/llite/*/statahead_max solution that fixed the statahead problem? Many thanks, Chris _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
