Andreas, sorry I missed your reply yesterday. Here is how we fixed it, We deleted OBJECTS/* and CATALOGS,
shutdown all the ost's At this point the mds mounted correctly with -o abort_recov Remount ost's (with recovery) and all worked well, I have sense enabled (re)quotas, heartbeat and bounced servers a few times. All appears well. Brock Palen www.umich.edu/~brockp Center for Advanced Computing [email protected] (734)936-1985 On Aug 20, 2009, at 9:09 AM, Brock Palen wrote: > Some additional details, > I mounted the mds as ldiskfs and deleted the files in OBJECTS/* and > CATALOGS, > Remounted as lustre, same issue. > I also did a write conf, restarted all the servers, saw messages on > the MGS, that new config logs were being created, but still same error > on the mds trying to start up. > Is there a way to get lustre to stop trying to open > 0xf150010:80d24629: ? And not go though recovery? > > If not, can I format a new mds, and just untar ROOTS/ and apply > the extended attributes to ROOTS from the old mds filesystem? > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > [email protected] > (734)936-1985 > > > > On Aug 19, 2009, at 12:57 PM, Brock Palen wrote: > >> After a network event (switches bouncing) looks like our mds got >> borked somewhere, from all the random failovers (switches came up and >> down rapidly over a few hours). >> >> Now we can not mount the mds, when we do we get the following >> errors: >> >> Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup- >> MDT0000_UUID' is not available for connect (no target) >> Aug 19 12:37:39 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c: >> 1619:target_send_reply_msg()) @@@ processing error (-19) >> r...@000001037c9db600 x85226/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 >> dl >> 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 >> Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup- >> MDT0000_UUID' is not available for connect (no target) >> Aug 19 12:37:39 mds2 kernel: LustreError: 7456:0:(ldlm_lib.c: >> 1619:target_send_reply_msg()) @@@ processing error (-19) >> r...@00000104163a6000 x47117/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 >> dl >> 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 >> Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup- >> MDT0000_UUID' is not available for connect (no target)Aug 19 >> 12:37:39 >> mds2 kernel: LustreError: Skipped 11 previous similar messages >> Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c: >> 1619:target_send_reply_msg()) @@@ processing error (-19) >> r...@0000010350a4d200 x81788/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 >> dl >> 1250699959 ref 1 fl Interpret:/0/0 rc -19/0 >> Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c: >> 1619:target_send_reply_msg()) Skipped 11 previous similar messages >> Aug 19 12:37:40 mds2 kernel: LustreError: 137-5: UUID 'nobackup- >> MDT0000_UUID' is not available for connect (no target) >> Aug 19 12:37:40 mds2 kernel: LustreError: Skipped 18 previous similar >> messages >> Aug 19 12:37:40 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c: >> 1619:target_send_reply_msg()) @@@ processing error (-19) >> r...@0000010414dc1850 x81855/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 >> dl >> 1250699960 ref 1 fl Interpret:/0/0 rc -19/0Aug 19 12:37:40 mds2 >> kernel: LustreError: 7455:0:(ldlm_lib.c:1619:target_send_reply_msg()) >> Skipped 18 previous similar messages >> Aug 19 12:37:42 mds2 kernel: LustreError: 137-5: UUID 'nobackup- >> MDT0000_UUID' is not available for connect (no target) >> Aug 19 12:37:42 mds2 kernel: LustreError: Skipped 42 previous similar >> messages >> Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c: >> 1619:target_send_reply_msg()) @@@ processing error (-19) >> r...@000001037c9db600 x77144/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 >> dl >> 1250699962 ref 1 fl Interpret:/0/0 rc -19/0 >> Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c: >> 1619:target_send_reply_msg()) Skipped 42 previous similar messages >> Aug 19 12:37:43 mds2 kernel: Lustre: Request x3 sent from >> mgc10.164.3....@tcp to NID 10.164.3....@tcp 5s ago has timed out >> (limit 5s). >> Aug 19 12:37:43 mds2 kernel: Lustre: Changing connection for >> mgc10.164.3....@tcp to mgc10.164.3....@tcp_1/0...@lo >> Aug 19 12:37:43 mds2 kernel: Lustre: Enabling user_xattr >> Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(mds_fs.c: >> 493:mds_init_server_data()) RECOVERY: service nobackup-MDT0000, 439 >> recoverable clients, last_transno 3647966566 >> Aug 19 12:37:43 mds2 kernel: Lustre: MDT nobackup-MDT0000 now serving >> dev (nobackup-MDT0000/57dddb69-2475-b551-4100-e045f91ce38c), but will >> be in recovery for at least 5:00, or >> until 439 clients reconnect. During this time new clients will not be >> allowed to connect. Recovery progress can be monitored by watching / >> proc/fs/lustre/mds/nobackup-MDT0000/rec >> overy_status. >> Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(lproc_mds.c: >> 273:lprocfs_wr_group_upcall()) nobackup-MDT0000: group upcall set >> to / >> usr/sbin/l_getgroups >> Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000.mdt: set >> parameter group_upcall=/usr/sbin/l_getgroupsAug 19 12:37:43 mds2 >> kernel: Lustre: 7524:0:(mds_lov.c:1070:mds_notify()) MDS nobackup- >> MDT0000: in recovery, not resetting orphans on nobackup-OST0000_UUID >> Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000: temporarily >> refusing client connection from 10.164.1....@tcp >> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c: >> 612:llog_lvfs_create()) error looking up logfile >> 0xf150010:0x80d24629: >> rc -2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c: >> 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: >> rc -2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c: >> 262:cat_cancel_cb()) Cannot find handle for log 0xf150010 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c: >> 329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed: >> -2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: >> 3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: >> 3675:osc_llog_init()) osc 'nobackup-OST0000-osc' tgt 'nobackup- >> MDT0000' cnt 1 catid 00000101e1d979e8 rc=-2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: >> 3677:osc_llog_init()) logid 0xf150002:0x9642a0ac >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_log.c: >> 230:lov_llog_init()) error osc_llog_init idx 0 osc 'nobackup-OST0000- >> osc' tgt 'nobackup-MDT0000' (rc=-2) >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(mds_log.c: >> 220:mds_llog_init()) lov_llog_init err -2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c: >> 417:llog_cat_initialize()) rc: -2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_obd.c: >> 727:lov_add_target()) add failed (-2), deleting nobackup-OST0000_UUID >> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(obd_config.c: >> 1093:class_config_llog_handler()) Err -2 on cfg command: >> Aug 19 12:37:43 mds2 kernel: Lustre: cmd=cf00d 0:nobackup-mdtlov >> 1:nobackup-OST0000_UUID 2:0 3:1 >> Aug 19 12:37:43 mds2 kernel: LustreError: 15c-8: mgc10.164.3....@tcp: >> The configuration from log 'nobackup-MDT0000' failed (-2). This may >> be >> the result of communication errors b >> etween this node and the MGS, a bad configuration, or other errors. >> See the syslog for more information. >> Aug 19 12:37:43 mds2 kernel: LustreError: 7438:0:(obd_mount.c: >> 1113:server_start_targets()) failed to start server nobackup- >> MDT0000: -2 >> Aug 19 12:37:44 mds2 kernel: LustreError: 7438:0:(obd_mount.c: >> 1623:server_fill_super()) Unable to start targets: -2 >> Aug 19 12:37:44 mds2 kernel: Lustre: Failing over nobackup-MDT0000 >> Aug 19 12:37:44 mds2 kernel: Lustre: *** setting obd nobackup-MDT0000 >> device 'unknown-block(8,16)' read-only *** >> >> We have ran e2fsck on the volume, found a few errors and corrected. >> But the problem presists. We also tried mounting with -o abort_recov >> this resulted in a assertion (lbug) and does not work. >> ANy thoughts? The lines: >> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c: >> 612:llog_lvfs_create()) error looking up logfile >> 0xf150010:0x80d24629: >> rc -2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c: >> 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: >> rc -2 >> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c: >> 262:cat_cancel_cb()) Cannot find handle for log 0xf150010 >> >> Catch my attention, >> Thanks, we are running 1.6.6 >> >> >> Brock Palen >> www.umich.edu/~brockp >> Center for Advanced Computing >> [email protected] >> (734)936-1985 >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
