In my continuing quest (and self-education) on lustre networking (lctl ping, and obd_ping, in particular....):
My MGS/MDS box is losing the connection to one and only one particular OSS and then restoring in all within the same wall-clock second: MGDS/MDS /var/log/messages: Sep 22 11:04:58 mds1 kernel: LustreError: Skipped 9 previous similar messages Sep 22 11:04:58 mds1 kernel: Lustre: crew8-OST0003-osc: Connection to service crew8-OST0003 via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. Sep 22 11:04:58 mds1 kernel: Lustre: Skipped 9 previous similar messages Sep 22 11:04:58 mds1 kernel: LustreError: 167-0: This client was evicted by crew8-OST0003; in progress operations using this service will fail. Sep 22 11:04:58 mds1 kernel: LustreError: Skipped 9 previous similar messages Sep 22 11:04:58 mds1 kernel: Lustre: 931:0:(quota_master.c:1100:mds_quota_recovery()) Not all osts are active, abort quota recovery Sep 22 11:04:58 mds1 kernel: Lustre: crew8-OST0003-osc: Connection restored to service crew8-OST0003 using nid [EMAIL PROTECTED] Sep 22 11:04:58 mds1 kernel: Lustre: Skipped 9 previous similar messages Sep 22 11:04:59 mds1 kernel: Lustre: MDS crew8-MDT0000: crew8-OST0003_UUID now active, resetting orphans My corresponding problem OSS has a <i>processing error</i> ??? and then resets its own connection: OSS4 /var/log/messages: Sep 22 11:00:16 oss4 kernel: LustreError: 4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x1788392/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0 rc -107/0 Sep 22 11:00:16 oss4 kernel: LustreError: 4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 12 previous similar messages Sep 22 11:04:59 oss4 kernel: Lustre: crew8-OST0003: received MDS connection from [EMAIL PROTECTED] Sep 22 11:04:59 oss4 kernel: Lustre: Skipped 9 previous similar messages Sep 22 11:07:20 oss4 kernel: Lustre: crew8-OST0001: haven't heard from client crew8-mdtlov_UUID (at [EMAIL PROTECTED]) in 391 seconds. I think it's dead, and I am evicting it. My client box here has the same connection error but minutes later(!!). Odd. The boxes all use ntpd and sync from a common time server here. But the notable thing is that the obd_ping, lost connection, eviction and then restoration all occurr with a wall-clock minute of one another. crew01 /var/log/messages: Sep 22 11:16:56 cn2 kernel: LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The obd_ping operation failed with -107 Sep 22 11:16:56 cn2 kernel: LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The obd_ping operation failed with -107 Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages Sep 22 11:16:56 cn2 kernel: Lustre: crew8-OST0000-osc-ffff81083ea5c400: Connection to service crew8-OST0000 via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. Sep 22 11:16:56 cn2 kernel: Lustre: crew8-OST0000-osc-ffff81083ea5c400: Connection to service crew8-OST0000 via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. Sep 22 11:16:56 cn2 kernel: Lustre: Skipped 4 previous similar messages Sep 22 11:16:56 cn2 kernel: Lustre: Skipped 4 previous similar messages Sep 22 11:16:56 cn2 kernel: LustreError: 167-0: This client was evicted by crew8-OST0000; in progress operations using this service will fail. Sep 22 11:16:56 cn2 kernel: LustreError: 167-0: This client was evicted by crew8-OST0000; in progress operations using this service will fail. Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages Sep 22 11:16:56 cn2 kernel: Lustre: crew8-OST0000-osc-ffff81083ea5c400: Connection restored to service crew8-OST0000 using nid [EMAIL PROTECTED] Sep 22 11:16:56 cn2 kernel: Lustre: crew8-OST0000-osc-ffff81083ea5c400: Connection restored to service crew8-OST0000 using nid [EMAIL PROTECTED] I have swapped IB network cables. The linux (CentOS 5 on all systems) ping has no dropped packets between any of the systems on the o2ib network. All lctl pings return normally. All systems are running the same OS code--- [EMAIL PROTECTED] ~]# uname -a Linux oss4.crew.local 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun Feb 17 08:38:44 EST 2008 x86_64 x86_64 x86_64 GNU/Linux What is this "LustreError: 4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107)" error on my OSS? As the end-users are not noticing anything and all of the activity on this one OSS is "no communication--evicted--restored" inside of a minute, should I do anything other than clean my becoming-voluminous logfiles more frequently? megan _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
