Hello List, I have finally got my new lustre disk on-line having rescued as much as I could from the old, hardware-failed volume.
The new disk is mounted on new hardware, OSS4 (for Object Storage Server4---no I am not imaginative). The disk OSTs are called "crew8". They mount happily as shown on OSS4. /dev/sdb1 6.3T 897G 5.1T 15% /srv/lustre/OST/crew8-OST0000 /dev/sdb2 6.3T 867G 5.1T 15% /srv/lustre/OST/crew8-OST0001 /dev/sdc1 6.3T 892G 5.1T 15% /srv/lustre/OST/crew8-OST0002 /dev/sdc2 6.3T 1003G 5.0T 17% /srv/lustre/OST/crew8-OST0003 /dev/sdd1 6.3T 907G 5.1T 15% /srv/lustre/OST/crew8-OST0004 /dev/sdd2 6.3T 877G 5.1T 15% /srv/lustre/OST/crew8-OST0005 /dev/sdi1 6.3T 916G 5.1T 16% /srv/lustre/OST/crew8-OST0006 /dev/sdi2 6.3T 920G 5.1T 16% /srv/lustre/OST/crew8-OST0007 /dev/sdj1 6.3T 901G 5.1T 15% /srv/lustre/OST/crew8-OST0008 /dev/sdj2 6.3T 895G 5.1T 15% /srv/lustre/OST/crew8-OST0009 /dev/sdk1 6.3T 878G 5.1T 15% /srv/lustre/OST/crew8-OST0010 /dev/sdk2 6.3T 891G 5.1T 15% /srv/lustre/OST/crew8-OST0011 The MGS/MDS (which servers two other lustre volumes for us currently) shows the following info: [EMAIL PROTECTED] ~]# lctl dl 0 UP mgs MGS MGS 13 1 UP mgc [EMAIL PROTECTED] 81039216-0261-c74d-3f2f-a504788ad8f8 5 2 UP mdt MDS MDS_uuid 3 3 UP lov crew2-mdtlov crew2-mdtlov_UUID 4 4 UP mds crew2-MDT0000 crew2mds_UUID 7 5 UP osc crew2-OST0000-osc crew2-mdtlov_UUID 5 6 UP osc crew2-OST0001-osc crew2-mdtlov_UUID 5 7 UP osc crew2-OST0002-osc crew2-mdtlov_UUID 5 8 UP lov crew3-mdtlov crew3-mdtlov_UUID 4 9 UP mds crew3-MDT0000 crew3mds_UUID 7 10 UP osc crew3-OST0000-osc crew3-mdtlov_UUID 5 11 UP osc crew3-OST0001-osc crew3-mdtlov_UUID 5 12 UP osc crew3-OST0002-osc crew3-mdtlov_UUID 5 13 UP lov crew8-mdtlov crew8-mdtlov_UUID 4 14 UP mds crew8-MDT0000 crew8-MDT0000_UUID 9 15 UP osc crew8-OST0000-osc crew8-mdtlov_UUID 5 16 UP osc crew8-OST0001-osc crew8-mdtlov_UUID 5 17 UP osc crew8-OST0002-osc crew8-mdtlov_UUID 5 18 UP osc crew8-OST0003-osc crew8-mdtlov_UUID 5 19 UP osc crew8-OST0004-osc crew8-mdtlov_UUID 5 20 UP osc crew8-OST0005-osc crew8-mdtlov_UUID 5 21 UP osc crew8-OST0006-osc crew8-mdtlov_UUID 5 22 UP osc crew8-OST0007-osc crew8-mdtlov_UUID 5 23 UP osc crew8-OST0008-osc crew8-mdtlov_UUID 5 24 UP osc crew8-OST0009-osc crew8-mdtlov_UUID 5 25 UP osc crew8-OST000a-osc crew8-mdtlov_UUID 5 26 UP osc crew8-OST000b-osc crew8-mdtlov_UUID 5 (NOTE: last two disks came in as crew8-OST000a and crew8-OST000b and not crew8-OST0010 and crew8-OST0011 respectively. I don't know if that has anything at all to do with my issue.) The clients are forever losing this one crew8 volume (mounted on the clients as /crewdat). >From /var/log/messages: [EMAIL PROTECTED] ~]$ tail /var/log/messages Sep 18 13:53:10 crew01 kernel: Lustre: crew8-OST0002-osc-ffff8101edbff400: Connection restored to service crew8-OST0002 using nid [EMAIL PROTECTED] Sep 18 13:53:10 crew01 kernel: Lustre: Skipped 4 previous similar messages Sep 18 13:54:05 cn2 kernel: LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The obd_ping operation failed with -107 Sep 18 13:54:05 cn2 kernel: LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The obd_ping operation failed with -107 Sep 18 13:54:05 cn2 kernel: LustreError: Skipped 9 previous similar messages Sep 18 13:54:05 cn2 kernel: LustreError: Skipped 9 previous similar messages Sep 18 13:54:05 cn2 kernel: Lustre: crew8-OST0003-osc-ffff81083ea5c400: Connection to service crew8-OST0003 via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. Sep 18 13:54:05 cn2 kernel: Lustre: crew8-OST0003-osc-ffff81083ea5c400: Connection to service crew8-OST0003 via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. Sep 18 13:54:05 cn2 kernel: Lustre: Skipped 9 previous similar messages Sep 18 13:54:05 cn2 kernel: Lustre: Skipped 9 previous similar messages The MGS/MDS /var/log/messages reads: [EMAIL PROTECTED] ~]# tail /var/log/messages Sep 18 13:50:58 mds1 kernel: LustreError: Skipped 20 previous similar messages Sep 18 13:50:58 mds1 kernel: Lustre: crew8-OST0005-osc: Connection to service crew8-OST0005 via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. Sep 18 13:50:58 mds1 kernel: Lustre: Skipped 20 previous similar messages Sep 18 13:50:58 mds1 kernel: LustreError: 167-0: This client was evicted by crew8-OST0005; in progress operations using this service will fail. Sep 18 13:50:58 mds1 kernel: LustreError: Skipped 20 previous similar messages Sep 18 13:50:58 mds1 kernel: Lustre: 568:0:(quota_master.c:1100:mds_quota_recovery()) Not all osts are active, abort quota recovery Sep 18 13:50:58 mds1 kernel: Lustre: crew8-OST0005-osc: Connection restored to service crew8-OST0005 using nid [EMAIL PROTECTED] Sep 18 13:50:58 mds1 kernel: Lustre: Skipped 20 previous similar messages Sep 18 13:50:58 mds1 kernel: Lustre: MDS crew8-MDT0000: crew8-OST000b_UUID now active, resetting orphans Sep 18 13:50:58 mds1 kernel: Lustre: Skipped 20 previous similar messages The OSS4 box /var/log/messages: [EMAIL PROTECTED] ~]# tail /var/log/messages Sep 18 13:40:40 oss4 kernel: Lustre: crew8-OST0000: haven't heard from client 794ff121-dfec-3934-338e-6b7f861f69b6 (at [EMAIL PROTECTED]) in 195 seconds. I think it's dead, and I am evicting it. Sep 18 13:40:40 oss4 kernel: Lustre: Skipped 25 previous similar messages Sep 18 13:44:50 oss4 kernel: LustreError: 3954:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x8274144/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0 rc -107/0 Sep 18 13:44:50 oss4 kernel: LustreError: 3954:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 23 previous similar messages Sep 18 13:50:58 oss4 kernel: Lustre: crew8-OST000b: received MDS connection from [EMAIL PROTECTED] Sep 18 13:50:58 oss4 kernel: Lustre: Skipped 20 previous similar messages Sep 18 13:51:00 oss4 kernel: Lustre: crew8-OST0006: haven't heard from client crew8-mdtlov_UUID (at [EMAIL PROTECTED]) in 251 seconds. I think it's dead, and I am evicting it. Sep 18 13:51:00 oss4 kernel: Lustre: Skipped 30 previous similar messages Sep 18 13:55:08 oss4 kernel: LustreError: 3993:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x9085095/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0 rc -107/0 Sep 18 13:55:08 oss4 kernel: LustreError: 3993:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 35 previous similar messages So---I am seeing that OSS4 is repeatedly losing its network contact with MGS/MDS machine mds1. Each time this occurs the mds1 box tells the client the disk mounted as /crewdat that the disk is evicted and will wait for recovery service to complete. A check on mds1 of the crew8-MDT0000 target shows that no recovery is occurring (and has not AFAICT since I mounted the disk on the clients, post-recovery). On mds1: cat /proc/fs/lustre/mds/crew8-MDT0000/recovery_status status: COMPLETE recovery_start: 1221745009 recovery_end: 1221745185 recovered_clients: 1 unrecovered_clients: 0 last_transno: 33954534 replayed_requests: 0 I am guessing that I need to increase a lustre client timeout value for our o2ib connections for the new disk to not generate these messages (the /crewdat disk itself seems to be fine for user access). The other two lustre volumes on the system seem content. Is my guess correct? If yes, what timeout value do I need to increase? Thank you, megan _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
