Hello, On 14/09/16 20:58, Bernd Schubert wrote: > Hi Cédric, > > I'm by no means familiar with Lustre code anymore, but based on the stack > trace and function names, it seems to be a problem with the journal. Maybe > try > to do an 'efsck -f' which would replay the journal and possibly clean up the > file it has problem with.
Thanks for the tip. Unfortunately, I did perform a filesystem check as part of my attempts for recovery (and even ran a dry-run afterwards, to make sure no errors were dangling). Cédric > > > Cheers, > Bernd > > > On Wednesday, September 14, 2016 9:28:38 AM CEST Cédric Dufour - Idiap > Research Institute wrote: >> Hello, >> >> Last Friday, during normal operations, our MDS froze with the following >> LBUG, which happens again as soon as one mounts the MDT again: >> >> Sep 13 15:10:28 n00a kernel: [ 8414.600584] LustreError: >> 11696:0:(osd_handler.c:936:osd_trans_start()) ASSERTION( >> get_current()->journal_info == ((void *)0) ) failed: Sep 13 15:10:28 >> n00a kernel: [ 8414.612825] LustreError: >> 11696:0:(osd_handler.c:936:osd_trans_start()) LBUG >> Sep 13 15:10:28 n00a kernel: [ 8414.619833] Pid: 11696, comm: lfsck >> Sep 13 15:10:28 n00a kernel: [ 8414.619835] Sep 13 15:10:28 n00a kernel: >> [ 8414.619835] Call Trace: >> Sep 13 15:10:28 n00a kernel: [ 8414.619850] [<ffffffffa0224822>] >> libcfs_debug_dumpstack+0x52/0x80 [libcfs] >> Sep 13 15:10:28 n00a kernel: [ 8414.619857] [<ffffffffa0224db2>] >> lbug_with_loc+0x42/0xa0 [libcfs] >> Sep 13 15:10:28 n00a kernel: [ 8414.619864] [<ffffffffa0b11890>] >> osd_trans_start+0x250/0x630 [osd_ldiskfs] >> Sep 13 15:10:28 n00a kernel: [ 8414.619870] [<ffffffffa0b0e748>] ? >> osd_declare_xattr_set+0x58/0x230 [osd_ldiskfs] >> Sep 13 15:10:28 n00a kernel: [ 8414.619876] [<ffffffffa0c6ffc7>] >> lod_trans_start+0x177/0x200 [lod] >> Sep 13 15:10:28 n00a kernel: [ 8414.619881] [<ffffffffa0cbd752>] >> lfsck_namespace_double_scan+0x1122/0x1e50 [lfsck] >> Sep 13 15:10:28 n00a kernel: [ 8414.619888] [<ffffffff8136741b>] ? >> thread_return+0x3e/0x10c >> Sep 13 15:10:28 n00a kernel: [ 8414.619894] [<ffffffff81038b87>] ? >> enqueue_task_fair+0x58/0x5d >> Sep 13 15:10:28 n00a kernel: [ 8414.619899] [<ffffffffa0cb68ea>] >> lfsck_double_scan+0x5a/0x70 [lfsck] >> Sep 13 15:10:28 n00a kernel: [ 8414.619904] [<ffffffffa0cb7dfd>] >> lfsck_master_engine+0x50d/0x650 [lfsck] >> Sep 13 15:10:28 n00a kernel: [ 8414.619909] [<ffffffffa0cb78f0>] ? >> lfsck_master_engine+0x0/0x650 [lfsck] >> Sep 13 15:10:28 n00a kernel: [ 8414.619915] [<ffffffff810534c4>] >> kthread+0x7b/0x83 >> Sep 13 15:10:28 n00a kernel: [ 8414.619918] [<ffffffff810369d3>] ? >> finish_task_switch+0x48/0xb9 >> Sep 13 15:10:28 n00a kernel: [ 8414.619924] [<ffffffff8101092a>] >> child_rip+0xa/0x20 >> Sep 13 15:10:28 n00a kernel: [ 8414.619928] [<ffffffff81053449>] ? >> kthread+0x0/0x83 >> Sep 13 15:10:28 n00a kernel: [ 8414.619931] [<ffffffff81010920>] ? >> child_rip+0x0/0x20 >> >> >> I originally had the LFSCK launched in "dry-run" mode: >> >> lctl lfsck_start --device lustre-1-MDT0000 --dryrun on --type namespace >> >> The LFSCK was reported completed (I was 'watch[ing] -n 1' on a terminal) >> before the LBUG popped-up; now, I can't even get any output >> >> cat /proc/fs/lustre/mdd/lustre-1-MDT0000/lfsck_namespace # just hang >> there indefinitely >> >> >> I remember seing a lfsck_namespace file in the MDT underlyding LDISKFS; >> is there anything sensible I can do with it (e.g. would deleting it >> solve the situation) ? >> What else could I do ? >> >> >> Thanks for your answers and best regards, >> >> Cédric D. >> >> >> PS: I had this message originally posted on HPDD-discuss mailing list >> and just realized it was the wrong place; sorry for any crossposting case >> _______________________________________________ >> lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
