Well, I figured someone would ask that. :) The last messages that make it to syslog prior to the crash are....
Jun 2 10:29:54 hpcmds kernel: LDISKFS FS on md2, internal journal Jun 2 10:29:54 hpcmds kernel: LDISKFS-fs: recovery complete. Jun 2 10:29:54 hpcmds kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jun 2 10:29:54 hpcmds kernel: kjournald starting. Commit interval 5 seconds Jun 2 10:29:54 hpcmds kernel: LDISKFS FS on md2, internal journal Jun 2 10:29:54 hpcmds kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jun 2 10:29:54 hpcmds kernel: Lustre: MGS MGS started Jun 2 10:29:54 hpcmds kernel: Lustre: Enabling user_xattr Jun 2 10:29:54 hpcmds kernel: Lustre: 4540:0:(mds_fs.c: 446:mds_init_server_data()) RECOVERY: service ufhpc-MDT0000, 100 recoverable clients, last_transno 9412464331 Jun 2 10:29:54 hpcmds kernel: Lustre: MDT ufhpc-MDT0000 now serving dev (ufhpc-MDT0000/cac99db5-a66a-a6ac-4649-6ec8cc2dc0e7), but will be in recovery until 100 clients reconnect, or if no clients reconnect for 4:10; during that time new clients will not be allowed to connect. Recovery progress can be monitored by watching /proc/fs/lustre/mds/ ufhpc-MDT0000/recovery_status. Jun 2 10:29:55 hpcmds kernel: Lustre: 4540:0:(mds_lov.c: 858:mds_notify()) MDS ufhpc-MDT0000: in recovery, not resetting orphans on ufhpc-OST0004_UUID Jun 2 10:29:55 hpcmds kernel: Lustre: 4540:0:(mds_lov.c: 858:mds_notify()) MDS ufhpc-MDT0000: in recovery, not resetting orphans on ufhpc-OST0005_UUID Note that all of the clients are powered off and the OSS's are currently unmounted (though they appear to be fine). Unfortunately, getting the messages off the console (in the machine room) means using a pencil and paper (you'd think we have something as fancy as a ip-kvm console server, but alas, we do things, ahem, "inexpensively" here. I'm going to let the md mirrors resync before I try it again (although I don't think that should be an issue). If it crashes a third time, and I suspect it will, I'll include some of the stack trace. Of course, part of the problem is that it is deep enough that it goes off screen and we can't see the top of the trace (which is kind of useful). :) I was hoping for a silver bullet, but... Thanks, Charlie Taylor UF HPC Center On Jun 2, 2008, at 11:16 AM, Johann Lombardi wrote: > On Mon, Jun 02, 2008 at 11:02:11AM -0400, Charles Taylor wrote: >> We lost our MDS/MGS to a power failure yesterday evening. Just to >> be safe, we ran e2fsck on the combined MDT/MGT and there were only a >> couple of minor complaints about HTREE issues that it fixed. The >> MDT/MGT now fsck's cleanly. The problem is that, despite the >> clean >> e2fsck, the MGS is crashing in the lustre mount code when attempting >> to mount the MDT. > > Where is it crashing exactly? Any stack traces, assertion failures ... > on the console? > > Johann _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
