Greetings, It's Monday (sigh). I lost one dual-core Opteron 275 of two on my OSS box over the weekend. The /var/log/messages contained many "bus error on processor" messages. So Monday I rebooted the OSS with only one dual core CPU. The box came up just fine and I mounted the three lustre OST disks I have on that box. (CentOS 5 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun Feb 17 08:38:44 EST 2008 x86_64 x86_64 x86_64 GNU/Linux)
The problem is now my MGS/MDS box cannot access/use the disks on that box. The single MDT volume mounts without error but I see the following messages in the MGS/MDS /var/log/messages file: Sep 29 10:02:40 mds1 kernel: Lustre: MDT crew3-MDT0000 now serving dev (crew3-MDT0000/be7a58cd-e259-823f-486b-e974551d7ad6) with recovery enabled Sep 29 10:02:40 mds1 kernel: Lustre: Server crew3-MDT0000 on device /dev/md0 has started Sep 29 10:02:40 mds1 kernel: Lustre: MDS crew3-MDT0000: crew3d1_UUID now active, resetting orphans Sep 29 10:02:40 mds1 kernel: Lustre: Skipped 2 previous similar messages Sep 29 10:03:29 mds1 kernel: LustreError: 26914:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Sep 29 10:03:29 mds1 kernel: LustreError: 26914:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x17040407/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Sep 29 10:03:29 mds1 kernel: LustreError: 26915:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 501 on unconnected MGS Sep 29 10:03:29 mds1 kernel: LustreError: 26915:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x17040408/t0 o501-><?>@<?>:-1 lens 200/0 ref 0 fl Interpret:/0/0 rc -107/0 The messages do not repeat. On MGS/MDS I also have [EMAIL PROTECTED] ~]# cat /proc/fs/lustre/mds/crew3-MDT0000/recovery_status status: INACTIVE The lctl can successfully ping the OSS and the OST appear correctly in lctl dl. The peers on the MGS/MDS (which is still successfully serving other disks) appears normal. [EMAIL PROTECTED] ~]# cat /proc/sys/lnet/peers nid refs state max rtr min tx min queue [EMAIL PROTECTED] 1 ~rtr 0 0 0 0 0 0 [EMAIL PROTECTED] 1 ~rtr 8 8 8 8 7 0 [EMAIL PROTECTED] 1 ~rtr 8 8 8 8 -527 0 [EMAIL PROTECTED] 1 ~rtr 8 8 8 8 -261 0 [EMAIL PROTECTED] 1 ~rtr 8 8 8 8 7 0 [EMAIL PROTECTED] 1 ~rtr 8 8 8 8 6 0 [EMAIL PROTECTED] 1 ~rtr 8 8 8 8 -239 0 [EMAIL PROTECTED] 1 ~rtr 8 8 8 8 -2 0 [EMAIL PROTECTED] 1 ~rtr 8 8 8 8 -4 0 [EMAIL PROTECTED] 1 ~rtr 8 8 8 8 -4 0 [EMAIL PROTECTED] 1 ~rtr 8 8 8 8 -42 0 [EMAIL PROTECTED] 1 ~rtr 8 8 8 8 7 0 With this information I Google searched on the error and I found http://lustre.sev.net.ua/changeset/119/trunk/lustre. The page was timestamped 3/12/08 by Author shadow with the info below: trunk/lustre/ChangeLog r100 r119 Severity : major 16 Frequency : frequent on X2 node 17 Bugzilla : 15010 18 Description: mdc_set_open_replay_data LBUG 19 Details : Set replay data for requests that are eligible for replay. 20 21 Severity : normal 22 Bugzilla : 14321 23 Description: lustre_mgs: operation 101 on unconnected MGS 24 Details : When MGC is disconnected from MGS long enough, MGS will evict the 25 MGC, and late on MGC cannot successfully connect to MGS and a lot 26 of the error messages complaining that MGS is not connected. 27 28 Severity : major 16 29 Frequency : on start mds 17 30 Bugzilla : 14884 Okay. I am still running 2.6.18-53.1.13.el5_lustre.1.6.4.3smp, is there a way in which to get the MGS/MDS to once again access the OSTs associated with the MDT? The OSS box looks perfectly fine (minus one CPU). All the errors appear on the MGS/MDS box. The lustre disk will not mount on any of my clients. The message "mount.lustre: mount [EMAIL PROTECTED]:/crew3 at /crew3 failed: Transport endpoint is not connected" is all that occurs. Suggestions and advice greatly appreciated. Do I just have to wait a long time to let the disk "find itself"? Using lctl device xx and activate did not help. megan _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
