After a network event (switches bouncing) looks like our mds got borked somewhere, from all the random failovers (switches came up and down rapidly over a few hours).
Now we can not mount the mds, when we do we get the following errors: Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup- MDT0000_UUID' is not available for connect (no target) Aug 19 12:37:39 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c: 1619:target_send_reply_msg()) @@@ processing error (-19) r...@000001037c9db600 x85226/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup- MDT0000_UUID' is not available for connect (no target) Aug 19 12:37:39 mds2 kernel: LustreError: 7456:0:(ldlm_lib.c: 1619:target_send_reply_msg()) @@@ processing error (-19) r...@00000104163a6000 x47117/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup- MDT0000_UUID' is not available for connect (no target)Aug 19 12:37:39 mds2 kernel: LustreError: Skipped 11 previous similar messages Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c: 1619:target_send_reply_msg()) @@@ processing error (-19) r...@0000010350a4d200 x81788/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c: 1619:target_send_reply_msg()) Skipped 11 previous similar messages Aug 19 12:37:40 mds2 kernel: LustreError: 137-5: UUID 'nobackup- MDT0000_UUID' is not available for connect (no target) Aug 19 12:37:40 mds2 kernel: LustreError: Skipped 18 previous similar messages Aug 19 12:37:40 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c: 1619:target_send_reply_msg()) @@@ processing error (-19) r...@0000010414dc1850 x81855/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1250699960 ref 1 fl Interpret:/0/0 rc -19/0Aug 19 12:37:40 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 18 previous similar messages Aug 19 12:37:42 mds2 kernel: LustreError: 137-5: UUID 'nobackup- MDT0000_UUID' is not available for connect (no target) Aug 19 12:37:42 mds2 kernel: LustreError: Skipped 42 previous similar messages Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c: 1619:target_send_reply_msg()) @@@ processing error (-19) r...@000001037c9db600 x77144/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1250699962 ref 1 fl Interpret:/0/0 rc -19/0 Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c: 1619:target_send_reply_msg()) Skipped 42 previous similar messages Aug 19 12:37:43 mds2 kernel: Lustre: Request x3 sent from mgc10.164.3....@tcp to NID 10.164.3....@tcp 5s ago has timed out (limit 5s). Aug 19 12:37:43 mds2 kernel: Lustre: Changing connection for mgc10.164.3....@tcp to mgc10.164.3....@tcp_1/0...@lo Aug 19 12:37:43 mds2 kernel: Lustre: Enabling user_xattr Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(mds_fs.c: 493:mds_init_server_data()) RECOVERY: service nobackup-MDT0000, 439 recoverable clients, last_transno 3647966566 Aug 19 12:37:43 mds2 kernel: Lustre: MDT nobackup-MDT0000 now serving dev (nobackup-MDT0000/57dddb69-2475-b551-4100-e045f91ce38c), but will be in recovery for at least 5:00, or until 439 clients reconnect. During this time new clients will not be allowed to connect. Recovery progress can be monitored by watching / proc/fs/lustre/mds/nobackup-MDT0000/rec overy_status. Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(lproc_mds.c: 273:lprocfs_wr_group_upcall()) nobackup-MDT0000: group upcall set to / usr/sbin/l_getgroups Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000.mdt: set parameter group_upcall=/usr/sbin/l_getgroupsAug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(mds_lov.c:1070:mds_notify()) MDS nobackup- MDT0000: in recovery, not resetting orphans on nobackup-OST0000_UUID Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000: temporarily refusing client connection from 10.164.1....@tcp Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c: 612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629: rc -2 Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c: 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: rc -2 Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c: 262:cat_cancel_cb()) Cannot find handle for log 0xf150010 Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c: 329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed: -2 Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: 3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: 3675:osc_llog_init()) osc 'nobackup-OST0000-osc' tgt 'nobackup- MDT0000' cnt 1 catid 00000101e1d979e8 rc=-2 Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: 3677:osc_llog_init()) logid 0xf150002:0x9642a0ac Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_log.c: 230:lov_llog_init()) error osc_llog_init idx 0 osc 'nobackup-OST0000- osc' tgt 'nobackup-MDT0000' (rc=-2) Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(mds_log.c: 220:mds_llog_init()) lov_llog_init err -2 Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c: 417:llog_cat_initialize()) rc: -2 Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_obd.c: 727:lov_add_target()) add failed (-2), deleting nobackup-OST0000_UUID Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(obd_config.c: 1093:class_config_llog_handler()) Err -2 on cfg command: Aug 19 12:37:43 mds2 kernel: Lustre: cmd=cf00d 0:nobackup-mdtlov 1:nobackup-OST0000_UUID 2:0 3:1 Aug 19 12:37:43 mds2 kernel: LustreError: 15c-8: mgc10.164.3....@tcp: The configuration from log 'nobackup-MDT0000' failed (-2). This may be the result of communication errors b etween this node and the MGS, a bad configuration, or other errors. See the syslog for more information. Aug 19 12:37:43 mds2 kernel: LustreError: 7438:0:(obd_mount.c: 1113:server_start_targets()) failed to start server nobackup-MDT0000: -2 Aug 19 12:37:44 mds2 kernel: LustreError: 7438:0:(obd_mount.c: 1623:server_fill_super()) Unable to start targets: -2 Aug 19 12:37:44 mds2 kernel: Lustre: Failing over nobackup-MDT0000 Aug 19 12:37:44 mds2 kernel: Lustre: *** setting obd nobackup-MDT0000 device 'unknown-block(8,16)' read-only *** We have ran e2fsck on the volume, found a few errors and corrected. But the problem presists. We also tried mounting with -o abort_recov this resulted in a assertion (lbug) and does not work. ANy thoughts? The lines: Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c: 612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629: rc -2 Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c: 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: rc -2 Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c: 262:cat_cancel_cb()) Cannot find handle for log 0xf150010 Catch my attention, Thanks, we are running 1.6.6 Brock Palen www.umich.edu/~brockp Center for Advanced Computing [email protected] (734)936-1985 _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
