Hello,
After looking at the LFSCK source code - specifically lustre/lfsck/lfsck_namespace.c - I was led to believe that suppressing the "lfsck_namespace" file in the MDT LDISKFS should be safe enough (NB: I had a backup of the MDT on a disconnected DRBD peer for the worst case scenario): # mount -t ldiskfs -o rw /dev/mdt /mnt/mdt.ldiskfs # rm -f /mnt/mdt.ldiskfs/lfsck_namespace # umount /mnt/mdt.ldiskfs This allowed to start the MDT again. The "lfsck_namespace" reappeared with a "blank" payload: # mount.lustre -o rw /dev/mdt /lustre/mdt # debugfs -c -R 'stat lfsck_namespace' /dev/drbd3 2>/dev/null Inode: 156 Type: regular Mode: 0644 Flags: 0x0 Generation: 1420451772 Version: 0x00000000:00000000 User: 0 Group: 0 Size: 8192 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 16 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016 atime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016 mtime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016 crtime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016 Size of extra inode fields: 28 Extended attributes stored in inode body: lma = "00 00 00 00 00 00 00 00 03 00 00 00 02 00 00 00 0a 00 00 00 00 00 00 00 " (24) lma: fid=[0x200000003:0xa:0x0] compat=0 incompat=0 lfsck_namespace = "03 9d 62 a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " (256) BLOCKS: (0-1):178240-178241 TOTAL: 2 # cat /proc/fs/lustre/mdd/lustre-1-MDT0000/lfsck_namespace name: lfsck_namespace magic: 0xa0629d03 version: 2 status: init flags: param: <NULL> time_since_last_completed: N/A time_since_latest_start: N/A time_since_last_checkpoint: N/A latest_start_position: N/A, N/A, N/A last_checkpoint_position: N/A, N/A, N/A first_failure_position: N/A, N/A, N/A checked_phase1: 0 checked_phase2: 0 updated_phase1: 0 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 dirs: 0 M-linked: 0 nlinks_repaired: 0 lost_found: 0 success_count: 0 run_time_phase1: 0 seconds run_time_phase2: 0 seconds average_speed_phase1: 0 items/sec average_speed_phase2: 0 objs/sec real-time_speed_phase1: N/A real-time_speed_phase2: N/A current_position: N/A Rather ungraceful - I don't know where that would have led me if I hadn't use the "--dryrun on" flag on the LFSCK command (or was that flag actually ignored? for it did not appear as a "param: ..." in the lfsck_namespace output when LFSCK was running) - but got me out of the present deadlock. Forgot to mention: (still...) running Lustre 2.5.2 Best, Cédric On 15/09/16 08:05, Cédric Dufour - Idiap Research Institute wrote: > Hello, > > On 14/09/16 20:58, Bernd Schubert wrote: >> Hi Cédric, >> >> I'm by no means familiar with Lustre code anymore, but based on the stack >> trace and function names, it seems to be a problem with the journal. Maybe >> try >> to do an 'efsck -f' which would replay the journal and possibly clean up the >> file it has problem with. > Thanks for the tip. > > Unfortunately, I did perform a filesystem check as part of my attempts for > recovery (and even ran a dry-run afterwards, to make sure no errors were > dangling). > > Cédric > > >> >> Cheers, >> Bernd >> >> >> On Wednesday, September 14, 2016 9:28:38 AM CEST Cédric Dufour - Idiap >> Research Institute wrote: >>> Hello, >>> >>> Last Friday, during normal operations, our MDS froze with the following >>> LBUG, which happens again as soon as one mounts the MDT again: >>> >>> Sep 13 15:10:28 n00a kernel: [ 8414.600584] LustreError: >>> 11696:0:(osd_handler.c:936:osd_trans_start()) ASSERTION( >>> get_current()->journal_info == ((void *)0) ) failed: Sep 13 15:10:28 >>> n00a kernel: [ 8414.612825] LustreError: >>> 11696:0:(osd_handler.c:936:osd_trans_start()) LBUG >>> Sep 13 15:10:28 n00a kernel: [ 8414.619833] Pid: 11696, comm: lfsck >>> Sep 13 15:10:28 n00a kernel: [ 8414.619835] Sep 13 15:10:28 n00a kernel: >>> [ 8414.619835] Call Trace: >>> Sep 13 15:10:28 n00a kernel: [ 8414.619850] [<ffffffffa0224822>] >>> libcfs_debug_dumpstack+0x52/0x80 [libcfs] >>> Sep 13 15:10:28 n00a kernel: [ 8414.619857] [<ffffffffa0224db2>] >>> lbug_with_loc+0x42/0xa0 [libcfs] >>> Sep 13 15:10:28 n00a kernel: [ 8414.619864] [<ffffffffa0b11890>] >>> osd_trans_start+0x250/0x630 [osd_ldiskfs] >>> Sep 13 15:10:28 n00a kernel: [ 8414.619870] [<ffffffffa0b0e748>] ? >>> osd_declare_xattr_set+0x58/0x230 [osd_ldiskfs] >>> Sep 13 15:10:28 n00a kernel: [ 8414.619876] [<ffffffffa0c6ffc7>] >>> lod_trans_start+0x177/0x200 [lod] >>> Sep 13 15:10:28 n00a kernel: [ 8414.619881] [<ffffffffa0cbd752>] >>> lfsck_namespace_double_scan+0x1122/0x1e50 [lfsck] >>> Sep 13 15:10:28 n00a kernel: [ 8414.619888] [<ffffffff8136741b>] ? >>> thread_return+0x3e/0x10c >>> Sep 13 15:10:28 n00a kernel: [ 8414.619894] [<ffffffff81038b87>] ? >>> enqueue_task_fair+0x58/0x5d >>> Sep 13 15:10:28 n00a kernel: [ 8414.619899] [<ffffffffa0cb68ea>] >>> lfsck_double_scan+0x5a/0x70 [lfsck] >>> Sep 13 15:10:28 n00a kernel: [ 8414.619904] [<ffffffffa0cb7dfd>] >>> lfsck_master_engine+0x50d/0x650 [lfsck] >>> Sep 13 15:10:28 n00a kernel: [ 8414.619909] [<ffffffffa0cb78f0>] ? >>> lfsck_master_engine+0x0/0x650 [lfsck] >>> Sep 13 15:10:28 n00a kernel: [ 8414.619915] [<ffffffff810534c4>] >>> kthread+0x7b/0x83 >>> Sep 13 15:10:28 n00a kernel: [ 8414.619918] [<ffffffff810369d3>] ? >>> finish_task_switch+0x48/0xb9 >>> Sep 13 15:10:28 n00a kernel: [ 8414.619924] [<ffffffff8101092a>] >>> child_rip+0xa/0x20 >>> Sep 13 15:10:28 n00a kernel: [ 8414.619928] [<ffffffff81053449>] ? >>> kthread+0x0/0x83 >>> Sep 13 15:10:28 n00a kernel: [ 8414.619931] [<ffffffff81010920>] ? >>> child_rip+0x0/0x20 >>> >>> >>> I originally had the LFSCK launched in "dry-run" mode: >>> >>> lctl lfsck_start --device lustre-1-MDT0000 --dryrun on --type namespace >>> >>> The LFSCK was reported completed (I was 'watch[ing] -n 1' on a terminal) >>> before the LBUG popped-up; now, I can't even get any output >>> >>> cat /proc/fs/lustre/mdd/lustre-1-MDT0000/lfsck_namespace # just hang >>> there indefinitely >>> >>> >>> I remember seing a lfsck_namespace file in the MDT underlyding LDISKFS; >>> is there anything sensible I can do with it (e.g. would deleting it >>> solve the situation) ? >>> What else could I do ? >>> >>> >>> Thanks for your answers and best regards, >>> >>> Cédric D. >>> >>> >>> PS: I had this message originally posted on HPDD-discuss mailing list >>> and just realized it was the wrong place; sorry for any crossposting case >>> _______________________________________________ >>> lustre-discuss mailing list >>> [email protected] >>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
