> Rather ungraceful - I don't know where that would have led me if I hadn't use > the "--dryrun on" flag on the LFSCK command (or was that flag actually > ignored? for it did not appear as a "param: ..." in the lfsck_namespace > output when LFSCK was running) - but got me out of the present deadlock. The "-n" option is NOT ignored, but unfortunately, such option was not properly displayed via the lproc interface. Such issue has been resolved via the patch http://review.whamcloud.com/#/c/7146/25/lustre/lfsck/lfsck_lib.c
> After looking at the LFSCK source code - specifically > lustre/lfsck/lfsck_namespace.c - I was led to believe that suppressing the > "lfsck_namespace" file in the MDT LDISKFS should be safe enough (NB: > I had a backup of the MDT on a disconnected DRBD peer for the worst > case > scenario): You are right. It is safe to remove the "lfsck_namespace" file under "ldiskfs" mode. The side effect will be: 1) The namespace LFSCK will scan the system from the beginning when it is triggered again, not from the last checkpoint since related information has been removed. 2) There will be one invalid OI mapping left in the ldiskfs OI file. Usually, nobody will access such OI mapping anymore. > first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877 That means some issues (the first one) has been detected under the "dryrun" mode. It is under the directory [0x135d:0x395d90f2:0x0]. You can get the path name via "lfs fid2path". As for the original LBUG(), I cannot point out the root reason. But to be clear, the LFSCK can only repair the Lustre inconsistency, that does NOT include the backend on-disk consistency, means that it can NOT replace the local e2fsck. -- Cheers, Nasf -----Original Message----- From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Cédric Dufour - Idiap Research Institute Sent: Monday, September 19, 2016 8:53 PM To: Yong, Fan <fan.y...@intel.com>; lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] (LFSCK) LBUG: ASSERTION( get_current()->journal_info == ((void *)0) ) failed - (ungracefully) SOLVED Hello Nasf, Below the entire ouput of LFSCK start/stop, using the '--dryrun' option and its '-n' shortcut: root@n00b:~ # lctl lfsck_start --device lustre-2-MDT0000 --dryrun on --type namespace Started LFSCK on the device lustre-2-MDT0000: namespace. # 2016-09-19 14:47:14 root@n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace name: lfsck_namespace magic: 0xa0629d03 version: 2 status: scanning-phase1 flags: inconsistent,upgrade param: <NULL> time_since_last_completed: 937135 seconds time_since_latest_start: 11 seconds time_since_last_checkpoint: N/A latest_start_position: 16, N/A, N/A last_checkpoint_position: N/A, N/A, N/A first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877 checked_phase1: 147222 checked_phase2: 0 updated_phase1: 1585 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 dirs: 2923 M-linked: 0 nlinks_repaired: 0 lost_found: 0 success_count: 1 run_time_phase1: 11 seconds run_time_phase2: 0 seconds average_speed_phase1: 13383 items/sec average_speed_phase2: N/A real-time_speed_phase1: 12842 items/sec real-time_speed_phase2: N/A current_position: 81972, [0x200014ba1:0xc1a:0x0], 3037912543715234557 # 2016-09-19 14:47:25 root@n00b:~ # lctl lfsck_stop --device lustre-2-MDT0000 Stopped LFSCK on the device lustre-2-MDT0000. # 2016-09-19 14:47:34 root@n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace name: lfsck_namespace magic: 0xa0629d03 version: 2 status: stopped flags: inconsistent,upgrade param: <NULL> time_since_last_completed: 937147 seconds time_since_latest_start: 23 seconds time_since_last_checkpoint: 3 seconds latest_start_position: 16, N/A, N/A last_checkpoint_position: 562059, [0x2000014f0:0x12ea5:0x0], 955442147668871112 first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877 checked_phase1: 305529 checked_phase2: 0 updated_phase1: 9172 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 dirs: 5339 M-linked: 0 nlinks_repaired: 0 lost_found: 0 success_count: 1 run_time_phase1: 20 seconds run_time_phase2: 0 seconds average_speed_phase1: 15276 items/sec average_speed_phase2: 0 objs/sec real-time_speed_phase1: N/A real-time_speed_phase2: N/A current_position: N/A # 2016-09-19 14:47:37 root@n00b:~ # lctl lfsck_start --device lustre-2-MDT0000 -n on --type namespace Started LFSCK on the device lustre-2-MDT0000: namespace. # 2016-09-19 14:47:50 root@n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace name: lfsck_namespace magic: 0xa0629d03 version: 2 status: scanning-phase1 flags: inconsistent,upgrade param: <NULL> time_since_last_completed: 937163 seconds time_since_latest_start: 3 seconds time_since_last_checkpoint: 19 seconds latest_start_position: 562060, [0x2000014f0:0x12ea5:0x0], 955476103740388659 last_checkpoint_position: 562059, [0x2000014f0:0x12ea5:0x0], 955442147668871112 first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877 checked_phase1: 419903 checked_phase2: 0 updated_phase1: 59322 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 dirs: 5506 M-linked: 0 nlinks_repaired: 0 lost_found: 0 success_count: 1 run_time_phase1: 23 seconds run_time_phase2: 0 seconds average_speed_phase1: 18256 items/sec average_speed_phase2: N/A real-time_speed_phase1: 44125 items/sec real-time_speed_phase2: N/A current_position: 622857, N/A, N/A # 2016-09-19 14:47:53 root@n00b:~ # lctl lfsck_stop --device lustre-2-MDT0000 Stopped LFSCK on the device lustre-2-MDT0000. # 2016-09-19 14:47:56 root@n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace name: lfsck_namespace magic: 0xa0629d03 version: 2 status: stopped flags: inconsistent,upgrade param: <NULL> time_since_last_completed: 937169 seconds time_since_latest_start: 9 seconds time_since_last_checkpoint: 3 seconds latest_start_position: 562060, [0x2000014f0:0x12ea5:0x0], 955476103740388659 last_checkpoint_position: 646716, N/A, N/A first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877 checked_phase1: 456954 checked_phase2: 0 updated_phase1: 59322 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 dirs: 5510 M-linked: 0 nlinks_repaired: 0 lost_found: 0 success_count: 1 run_time_phase1: 26 seconds run_time_phase2: 0 seconds average_speed_phase1: 17575 items/sec average_speed_phase2: 0 objs/sec real-time_speed_phase1: N/A real-time_speed_phase2: N/A current_position: N/A # 2016-09-19 14:47:59 Given my previous mileage, I'd rather not let LFSCK run to completion again. Best, Cédric On 18/09/16 03:05, Yong, Fan wrote: > To be clear, what are the console and proc output if "-n on" specified when > start the namespace LFSCK? > > -- > Regards, > Nasf > > -----Original Message----- > From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] > On Behalf Of Cédric Dufour - Idiap Research Institute > Sent: Thursday, September 15, 2016 10:09 PM > To: lustre-discuss@lists.lustre.org > Subject: Re: [lustre-discuss] (LFSCK) LBUG: ASSERTION( > get_current()->journal_info == ((void *)0) ) failed - (ungracefully) > SOLVED > > Hello, > > > After looking at the LFSCK source code - specifically > lustre/lfsck/lfsck_namespace.c - I was led to believe that suppressing the > "lfsck_namespace" file in the MDT LDISKFS should be safe enough (NB: > I had a backup of the MDT on a disconnected DRBD peer for the worst > case > scenario): > > # mount -t ldiskfs -o rw /dev/mdt /mnt/mdt.ldiskfs # rm -f > /mnt/mdt.ldiskfs/lfsck_namespace # umount /mnt/mdt.ldiskfs > > > This allowed to start the MDT again. The "lfsck_namespace" reappeared with a > "blank" payload: > > # mount.lustre -o rw /dev/mdt /lustre/mdt > > # debugfs -c -R 'stat lfsck_namespace' /dev/drbd3 2>/dev/null > Inode: 156 Type: regular Mode: 0644 Flags: 0x0 > Generation: 1420451772 Version: 0x00000000:00000000 > User: 0 Group: 0 Size: 8192 > File ACL: 0 Directory ACL: 0 > Links: 1 Blockcount: 16 > Fragment: Address: 0 Number: 0 Size: 0 > ctime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016 > atime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016 > mtime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016 > crtime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016 Size of extra inode > fields: 28 Extended attributes stored in inode body: > lma = "00 00 00 00 00 00 00 00 03 00 00 00 02 00 00 00 0a 00 00 00 > 00 > 00 00 00 " (24) > lma: fid=[0x200000003:0xa:0x0] compat=0 incompat=0 > lfsck_namespace = "03 9d 62 a0 00 00 00 00 00 00 00 00 00 00 00 00 > 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " > (256) > BLOCKS: > (0-1):178240-178241 > TOTAL: 2 > > # cat /proc/fs/lustre/mdd/lustre-1-MDT0000/lfsck_namespace > name: lfsck_namespace > magic: 0xa0629d03 > version: 2 > status: init > flags: > param: <NULL> > time_since_last_completed: N/A > time_since_latest_start: N/A > time_since_last_checkpoint: N/A > latest_start_position: N/A, N/A, N/A > last_checkpoint_position: N/A, N/A, N/A > first_failure_position: N/A, N/A, N/A > checked_phase1: 0 > checked_phase2: 0 > updated_phase1: 0 > updated_phase2: 0 > failed_phase1: 0 > failed_phase2: 0 > dirs: 0 > M-linked: 0 > nlinks_repaired: 0 > lost_found: 0 > success_count: 0 > run_time_phase1: 0 seconds > run_time_phase2: 0 seconds > average_speed_phase1: 0 items/sec > average_speed_phase2: 0 objs/sec > real-time_speed_phase1: N/A > real-time_speed_phase2: N/A > current_position: N/A > > > Rather ungraceful - I don't know where that would have led me if I hadn't use > the "--dryrun on" flag on the LFSCK command (or was that flag actually > ignored? for it did not appear as a "param: ..." in the lfsck_namespace > output when LFSCK was running) - but got me out of the present deadlock. > > Forgot to mention: (still...) running Lustre 2.5.2 > > > Best, > > > Cédric > > > > On 15/09/16 08:05, Cédric Dufour - Idiap Research Institute wrote: >> Hello, >> >> On 14/09/16 20:58, Bernd Schubert wrote: >>> Hi Cédric, >>> >>> I'm by no means familiar with Lustre code anymore, but based on the >>> stack trace and function names, it seems to be a problem with the >>> journal. Maybe try to do an 'efsck -f' which would replay the >>> journal and possibly clean up the file it has problem with. >> Thanks for the tip. >> >> Unfortunately, I did perform a filesystem check as part of my attempts for >> recovery (and even ran a dry-run afterwards, to make sure no errors were >> dangling). >> >> Cédric >> >> >>> Cheers, >>> Bernd >>> >>> >>> On Wednesday, September 14, 2016 9:28:38 AM CEST Cédric Dufour - >>> Idiap Research Institute wrote: >>>> Hello, >>>> >>>> Last Friday, during normal operations, our MDS froze with the >>>> following LBUG, which happens again as soon as one mounts the MDT again: >>>> >>>> Sep 13 15:10:28 n00a kernel: [ 8414.600584] LustreError: >>>> 11696:0:(osd_handler.c:936:osd_trans_start()) ASSERTION( >>>> get_current()->journal_info == ((void *)0) ) failed: Sep 13 >>>> 15:10:28 n00a kernel: [ 8414.612825] LustreError: >>>> 11696:0:(osd_handler.c:936:osd_trans_start()) LBUG Sep 13 15:10:28 >>>> n00a kernel: [ 8414.619833] Pid: 11696, comm: lfsck Sep 13 15:10:28 >>>> n00a kernel: [ 8414.619835] Sep 13 15:10:28 n00a kernel: >>>> [ 8414.619835] Call Trace: >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619850] [<ffffffffa0224822>] >>>> libcfs_debug_dumpstack+0x52/0x80 [libcfs] Sep 13 15:10:28 n00a >>>> kernel: [ 8414.619857] [<ffffffffa0224db2>] >>>> lbug_with_loc+0x42/0xa0 [libcfs] >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619864] [<ffffffffa0b11890>] >>>> osd_trans_start+0x250/0x630 [osd_ldiskfs] Sep 13 15:10:28 n00a >>>> kernel: [ 8414.619870] [<ffffffffa0b0e748>] ? >>>> osd_declare_xattr_set+0x58/0x230 [osd_ldiskfs] Sep 13 15:10:28 n00a >>>> kernel: [ 8414.619876] [<ffffffffa0c6ffc7>] >>>> lod_trans_start+0x177/0x200 [lod] >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619881] [<ffffffffa0cbd752>] >>>> lfsck_namespace_double_scan+0x1122/0x1e50 [lfsck] Sep 13 15:10:28 >>>> n00a kernel: [ 8414.619888] [<ffffffff8136741b>] ? >>>> thread_return+0x3e/0x10c >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619894] [<ffffffff81038b87>] ? >>>> enqueue_task_fair+0x58/0x5d >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619899] [<ffffffffa0cb68ea>] >>>> lfsck_double_scan+0x5a/0x70 [lfsck] Sep 13 15:10:28 n00a kernel: [ >>>> 8414.619904] [<ffffffffa0cb7dfd>] >>>> lfsck_master_engine+0x50d/0x650 [lfsck] Sep 13 15:10:28 n00a kernel: >>>> [ 8414.619909] [<ffffffffa0cb78f0>] ? >>>> lfsck_master_engine+0x0/0x650 [lfsck] Sep 13 15:10:28 n00a kernel: >>>> [ 8414.619915] [<ffffffff810534c4>] >>>> kthread+0x7b/0x83 >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619918] [<ffffffff810369d3>] ? >>>> finish_task_switch+0x48/0xb9 >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619924] [<ffffffff8101092a>] >>>> child_rip+0xa/0x20 >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619928] [<ffffffff81053449>] ? >>>> kthread+0x0/0x83 >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619931] [<ffffffff81010920>] ? >>>> child_rip+0x0/0x20 >>>> >>>> >>>> I originally had the LFSCK launched in "dry-run" mode: >>>> >>>> lctl lfsck_start --device lustre-1-MDT0000 --dryrun on --type >>>> namespace >>>> >>>> The LFSCK was reported completed (I was 'watch[ing] -n 1' on a >>>> terminal) before the LBUG popped-up; now, I can't even get any >>>> output >>>> >>>> cat /proc/fs/lustre/mdd/lustre-1-MDT0000/lfsck_namespace # just >>>> hang there indefinitely >>>> >>>> >>>> I remember seing a lfsck_namespace file in the MDT underlyding >>>> LDISKFS; is there anything sensible I can do with it (e.g. would >>>> deleting it solve the situation) ? >>>> What else could I do ? >>>> >>>> >>>> Thanks for your answers and best regards, >>>> >>>> Cédric D. >>>> >>>> >>>> PS: I had this message originally posted on HPDD-discuss mailing >>>> list and just realized it was the wrong place; sorry for any >>>> crossposting case _______________________________________________ >>>> lustre-discuss mailing list >>>> lustre-discuss@lists.lustre.org >>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >> _______________________________________________ >> lustre-discuss mailing list >> lustre-discuss@lists.lustre.org >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > _______________________________________________ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org