Hello Nasf, Below the entire ouput of LFSCK start/stop, using the '--dryrun' option and its '-n' shortcut:
root@n00b:~ # lctl lfsck_start --device lustre-2-MDT0000 --dryrun on --type namespace Started LFSCK on the device lustre-2-MDT0000: namespace. # 2016-09-19 14:47:14 root@n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace name: lfsck_namespace magic: 0xa0629d03 version: 2 status: scanning-phase1 flags: inconsistent,upgrade param: <NULL> time_since_last_completed: 937135 seconds time_since_latest_start: 11 seconds time_since_last_checkpoint: N/A latest_start_position: 16, N/A, N/A last_checkpoint_position: N/A, N/A, N/A first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877 checked_phase1: 147222 checked_phase2: 0 updated_phase1: 1585 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 dirs: 2923 M-linked: 0 nlinks_repaired: 0 lost_found: 0 success_count: 1 run_time_phase1: 11 seconds run_time_phase2: 0 seconds average_speed_phase1: 13383 items/sec average_speed_phase2: N/A real-time_speed_phase1: 12842 items/sec real-time_speed_phase2: N/A current_position: 81972, [0x200014ba1:0xc1a:0x0], 3037912543715234557 # 2016-09-19 14:47:25 root@n00b:~ # lctl lfsck_stop --device lustre-2-MDT0000 Stopped LFSCK on the device lustre-2-MDT0000. # 2016-09-19 14:47:34 root@n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace name: lfsck_namespace magic: 0xa0629d03 version: 2 status: stopped flags: inconsistent,upgrade param: <NULL> time_since_last_completed: 937147 seconds time_since_latest_start: 23 seconds time_since_last_checkpoint: 3 seconds latest_start_position: 16, N/A, N/A last_checkpoint_position: 562059, [0x2000014f0:0x12ea5:0x0], 955442147668871112 first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877 checked_phase1: 305529 checked_phase2: 0 updated_phase1: 9172 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 dirs: 5339 M-linked: 0 nlinks_repaired: 0 lost_found: 0 success_count: 1 run_time_phase1: 20 seconds run_time_phase2: 0 seconds average_speed_phase1: 15276 items/sec average_speed_phase2: 0 objs/sec real-time_speed_phase1: N/A real-time_speed_phase2: N/A current_position: N/A # 2016-09-19 14:47:37 root@n00b:~ # lctl lfsck_start --device lustre-2-MDT0000 -n on --type namespace Started LFSCK on the device lustre-2-MDT0000: namespace. # 2016-09-19 14:47:50 root@n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace name: lfsck_namespace magic: 0xa0629d03 version: 2 status: scanning-phase1 flags: inconsistent,upgrade param: <NULL> time_since_last_completed: 937163 seconds time_since_latest_start: 3 seconds time_since_last_checkpoint: 19 seconds latest_start_position: 562060, [0x2000014f0:0x12ea5:0x0], 955476103740388659 last_checkpoint_position: 562059, [0x2000014f0:0x12ea5:0x0], 955442147668871112 first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877 checked_phase1: 419903 checked_phase2: 0 updated_phase1: 59322 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 dirs: 5506 M-linked: 0 nlinks_repaired: 0 lost_found: 0 success_count: 1 run_time_phase1: 23 seconds run_time_phase2: 0 seconds average_speed_phase1: 18256 items/sec average_speed_phase2: N/A real-time_speed_phase1: 44125 items/sec real-time_speed_phase2: N/A current_position: 622857, N/A, N/A # 2016-09-19 14:47:53 root@n00b:~ # lctl lfsck_stop --device lustre-2-MDT0000 Stopped LFSCK on the device lustre-2-MDT0000. # 2016-09-19 14:47:56 root@n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace name: lfsck_namespace magic: 0xa0629d03 version: 2 status: stopped flags: inconsistent,upgrade param: <NULL> time_since_last_completed: 937169 seconds time_since_latest_start: 9 seconds time_since_last_checkpoint: 3 seconds latest_start_position: 562060, [0x2000014f0:0x12ea5:0x0], 955476103740388659 last_checkpoint_position: 646716, N/A, N/A first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877 checked_phase1: 456954 checked_phase2: 0 updated_phase1: 59322 updated_phase2: 0 failed_phase1: 0 failed_phase2: 0 dirs: 5510 M-linked: 0 nlinks_repaired: 0 lost_found: 0 success_count: 1 run_time_phase1: 26 seconds run_time_phase2: 0 seconds average_speed_phase1: 17575 items/sec average_speed_phase2: 0 objs/sec real-time_speed_phase1: N/A real-time_speed_phase2: N/A current_position: N/A # 2016-09-19 14:47:59 Given my previous mileage, I'd rather not let LFSCK run to completion again. Best, Cédric On 18/09/16 03:05, Yong, Fan wrote: > To be clear, what are the console and proc output if "-n on" specified when > start the namespace LFSCK? > > -- > Regards, > Nasf > > -----Original Message----- > From: lustre-discuss [mailto:[email protected]] On > Behalf Of Cédric Dufour - Idiap Research Institute > Sent: Thursday, September 15, 2016 10:09 PM > To: [email protected] > Subject: Re: [lustre-discuss] (LFSCK) LBUG: ASSERTION( > get_current()->journal_info == ((void *)0) ) failed - (ungracefully) SOLVED > > Hello, > > > After looking at the LFSCK source code - specifically > lustre/lfsck/lfsck_namespace.c - I was led to believe that suppressing the > "lfsck_namespace" file in the MDT LDISKFS should be safe enough (NB: > I had a backup of the MDT on a disconnected DRBD peer for the worst case > scenario): > > # mount -t ldiskfs -o rw /dev/mdt /mnt/mdt.ldiskfs # rm -f > /mnt/mdt.ldiskfs/lfsck_namespace # umount /mnt/mdt.ldiskfs > > > This allowed to start the MDT again. The "lfsck_namespace" reappeared with a > "blank" payload: > > # mount.lustre -o rw /dev/mdt /lustre/mdt > > # debugfs -c -R 'stat lfsck_namespace' /dev/drbd3 2>/dev/null > Inode: 156 Type: regular Mode: 0644 Flags: 0x0 > Generation: 1420451772 Version: 0x00000000:00000000 > User: 0 Group: 0 Size: 8192 > File ACL: 0 Directory ACL: 0 > Links: 1 Blockcount: 16 > Fragment: Address: 0 Number: 0 Size: 0 > ctime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016 > atime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016 > mtime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016 > crtime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016 Size of extra inode > fields: 28 Extended attributes stored in inode body: > lma = "00 00 00 00 00 00 00 00 03 00 00 00 02 00 00 00 0a 00 00 00 00 > 00 00 00 " (24) > lma: fid=[0x200000003:0xa:0x0] compat=0 incompat=0 > lfsck_namespace = "03 9d 62 a0 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " (256) > BLOCKS: > (0-1):178240-178241 > TOTAL: 2 > > # cat /proc/fs/lustre/mdd/lustre-1-MDT0000/lfsck_namespace > name: lfsck_namespace > magic: 0xa0629d03 > version: 2 > status: init > flags: > param: <NULL> > time_since_last_completed: N/A > time_since_latest_start: N/A > time_since_last_checkpoint: N/A > latest_start_position: N/A, N/A, N/A > last_checkpoint_position: N/A, N/A, N/A > first_failure_position: N/A, N/A, N/A > checked_phase1: 0 > checked_phase2: 0 > updated_phase1: 0 > updated_phase2: 0 > failed_phase1: 0 > failed_phase2: 0 > dirs: 0 > M-linked: 0 > nlinks_repaired: 0 > lost_found: 0 > success_count: 0 > run_time_phase1: 0 seconds > run_time_phase2: 0 seconds > average_speed_phase1: 0 items/sec > average_speed_phase2: 0 objs/sec > real-time_speed_phase1: N/A > real-time_speed_phase2: N/A > current_position: N/A > > > Rather ungraceful - I don't know where that would have led me if I hadn't use > the "--dryrun on" flag on the LFSCK command (or was that flag actually > ignored? for it did not appear as a "param: ..." in the lfsck_namespace > output when LFSCK was running) - but got me out of the present deadlock. > > Forgot to mention: (still...) running Lustre 2.5.2 > > > Best, > > > Cédric > > > > On 15/09/16 08:05, Cédric Dufour - Idiap Research Institute wrote: >> Hello, >> >> On 14/09/16 20:58, Bernd Schubert wrote: >>> Hi Cédric, >>> >>> I'm by no means familiar with Lustre code anymore, but based on the >>> stack trace and function names, it seems to be a problem with the >>> journal. Maybe try to do an 'efsck -f' which would replay the journal >>> and possibly clean up the file it has problem with. >> Thanks for the tip. >> >> Unfortunately, I did perform a filesystem check as part of my attempts for >> recovery (and even ran a dry-run afterwards, to make sure no errors were >> dangling). >> >> Cédric >> >> >>> Cheers, >>> Bernd >>> >>> >>> On Wednesday, September 14, 2016 9:28:38 AM CEST Cédric Dufour - >>> Idiap Research Institute wrote: >>>> Hello, >>>> >>>> Last Friday, during normal operations, our MDS froze with the >>>> following LBUG, which happens again as soon as one mounts the MDT again: >>>> >>>> Sep 13 15:10:28 n00a kernel: [ 8414.600584] LustreError: >>>> 11696:0:(osd_handler.c:936:osd_trans_start()) ASSERTION( >>>> get_current()->journal_info == ((void *)0) ) failed: Sep 13 15:10:28 >>>> n00a kernel: [ 8414.612825] LustreError: >>>> 11696:0:(osd_handler.c:936:osd_trans_start()) LBUG Sep 13 15:10:28 >>>> n00a kernel: [ 8414.619833] Pid: 11696, comm: lfsck Sep 13 15:10:28 >>>> n00a kernel: [ 8414.619835] Sep 13 15:10:28 n00a kernel: >>>> [ 8414.619835] Call Trace: >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619850] [<ffffffffa0224822>] >>>> libcfs_debug_dumpstack+0x52/0x80 [libcfs] Sep 13 15:10:28 n00a >>>> kernel: [ 8414.619857] [<ffffffffa0224db2>] >>>> lbug_with_loc+0x42/0xa0 [libcfs] >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619864] [<ffffffffa0b11890>] >>>> osd_trans_start+0x250/0x630 [osd_ldiskfs] Sep 13 15:10:28 n00a >>>> kernel: [ 8414.619870] [<ffffffffa0b0e748>] ? >>>> osd_declare_xattr_set+0x58/0x230 [osd_ldiskfs] Sep 13 15:10:28 n00a >>>> kernel: [ 8414.619876] [<ffffffffa0c6ffc7>] >>>> lod_trans_start+0x177/0x200 [lod] >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619881] [<ffffffffa0cbd752>] >>>> lfsck_namespace_double_scan+0x1122/0x1e50 [lfsck] Sep 13 15:10:28 >>>> n00a kernel: [ 8414.619888] [<ffffffff8136741b>] ? >>>> thread_return+0x3e/0x10c >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619894] [<ffffffff81038b87>] ? >>>> enqueue_task_fair+0x58/0x5d >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619899] [<ffffffffa0cb68ea>] >>>> lfsck_double_scan+0x5a/0x70 [lfsck] >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619904] [<ffffffffa0cb7dfd>] >>>> lfsck_master_engine+0x50d/0x650 [lfsck] Sep 13 15:10:28 n00a kernel: >>>> [ 8414.619909] [<ffffffffa0cb78f0>] ? >>>> lfsck_master_engine+0x0/0x650 [lfsck] Sep 13 15:10:28 n00a kernel: [ >>>> 8414.619915] [<ffffffff810534c4>] >>>> kthread+0x7b/0x83 >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619918] [<ffffffff810369d3>] ? >>>> finish_task_switch+0x48/0xb9 >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619924] [<ffffffff8101092a>] >>>> child_rip+0xa/0x20 >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619928] [<ffffffff81053449>] ? >>>> kthread+0x0/0x83 >>>> Sep 13 15:10:28 n00a kernel: [ 8414.619931] [<ffffffff81010920>] ? >>>> child_rip+0x0/0x20 >>>> >>>> >>>> I originally had the LFSCK launched in "dry-run" mode: >>>> >>>> lctl lfsck_start --device lustre-1-MDT0000 --dryrun on --type >>>> namespace >>>> >>>> The LFSCK was reported completed (I was 'watch[ing] -n 1' on a >>>> terminal) before the LBUG popped-up; now, I can't even get any >>>> output >>>> >>>> cat /proc/fs/lustre/mdd/lustre-1-MDT0000/lfsck_namespace # just >>>> hang there indefinitely >>>> >>>> >>>> I remember seing a lfsck_namespace file in the MDT underlyding >>>> LDISKFS; is there anything sensible I can do with it (e.g. would >>>> deleting it solve the situation) ? >>>> What else could I do ? >>>> >>>> >>>> Thanks for your answers and best regards, >>>> >>>> Cédric D. >>>> >>>> >>>> PS: I had this message originally posted on HPDD-discuss mailing >>>> list and just realized it was the wrong place; sorry for any >>>> crossposting case _______________________________________________ >>>> lustre-discuss mailing list >>>> [email protected] >>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >> _______________________________________________ >> lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
