> Rather ungraceful - I don't know where that would have led me if I hadn't use 
> the "--dryrun on" flag on the LFSCK command (or was that flag actually 
> ignored? for it did not appear as a "param: ..." in the lfsck_namespace 
> output when LFSCK was running) - but got me out of the present deadlock.
The "-n" option is NOT ignored, but unfortunately, such option was not properly 
displayed via the lproc interface. Such issue has been resolved via the patch 
http://review.whamcloud.com/#/c/7146/25/lustre/lfsck/lfsck_lib.c

> After looking at the LFSCK source code - specifically 
> lustre/lfsck/lfsck_namespace.c - I was led to believe that suppressing the 
> "lfsck_namespace" file in the MDT LDISKFS should be safe enough (NB:
> I had a backup of the MDT on a disconnected DRBD peer for the worst 
> case
> scenario):

You are right. It is safe to remove the "lfsck_namespace" file under "ldiskfs" 
mode. The side effect will be:
1) The namespace LFSCK will scan the system from the beginning when it is 
triggered again, not from the last checkpoint since related information has 
been removed.
2) There will be one invalid OI mapping left in the ldiskfs OI file. Usually, 
nobody will access such OI mapping anymore.

> first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877

That means some issues (the first one) has been detected under the "dryrun" 
mode. It is under the directory [0x135d:0x395d90f2:0x0]. You can get the path 
name via "lfs fid2path".

As for the original LBUG(), I cannot point out the root reason. But to be 
clear, the LFSCK can only repair the Lustre inconsistency, that does NOT 
include the backend on-disk consistency, means that it can NOT replace the 
local e2fsck.

--
Cheers,
Nasf

-----Original Message-----
From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of Cédric Dufour - Idiap Research Institute
Sent: Monday, September 19, 2016 8:53 PM
To: Yong, Fan <fan.y...@intel.com>; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] (LFSCK) LBUG: ASSERTION( 
get_current()->journal_info == ((void *)0) ) failed - (ungracefully) SOLVED

Hello Nasf,

Below the entire ouput of LFSCK start/stop, using the '--dryrun' option and its 
'-n' shortcut:

root@n00b:~ # lctl lfsck_start --device lustre-2-MDT0000 --dryrun on --type 
namespace Started LFSCK on the device lustre-2-MDT0000: namespace.
# 2016-09-19 14:47:14

root@n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace
name: lfsck_namespace
magic: 0xa0629d03
version: 2
status: scanning-phase1
flags: inconsistent,upgrade
param: <NULL>
time_since_last_completed: 937135 seconds
time_since_latest_start: 11 seconds
time_since_last_checkpoint: N/A
latest_start_position: 16, N/A, N/A
last_checkpoint_position: N/A, N/A, N/A
first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877
checked_phase1: 147222
checked_phase2: 0
updated_phase1: 1585
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
dirs: 2923
M-linked: 0
nlinks_repaired: 0
lost_found: 0
success_count: 1
run_time_phase1: 11 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 13383 items/sec
average_speed_phase2: N/A
real-time_speed_phase1: 12842 items/sec
real-time_speed_phase2: N/A
current_position: 81972, [0x200014ba1:0xc1a:0x0], 3037912543715234557 # 
2016-09-19 14:47:25

root@n00b:~ # lctl lfsck_stop --device lustre-2-MDT0000 Stopped LFSCK on the 
device lustre-2-MDT0000.
# 2016-09-19 14:47:34

root@n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace
name: lfsck_namespace
magic: 0xa0629d03
version: 2
status: stopped
flags: inconsistent,upgrade
param: <NULL>
time_since_last_completed: 937147 seconds
time_since_latest_start: 23 seconds
time_since_last_checkpoint: 3 seconds
latest_start_position: 16, N/A, N/A
last_checkpoint_position: 562059, [0x2000014f0:0x12ea5:0x0],
955442147668871112
first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877
checked_phase1: 305529
checked_phase2: 0
updated_phase1: 9172
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
dirs: 5339
M-linked: 0
nlinks_repaired: 0
lost_found: 0
success_count: 1
run_time_phase1: 20 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 15276 items/sec
average_speed_phase2: 0 objs/sec
real-time_speed_phase1: N/A
real-time_speed_phase2: N/A
current_position: N/A
# 2016-09-19 14:47:37

root@n00b:~ # lctl lfsck_start --device lustre-2-MDT0000 -n on --type namespace 
Started LFSCK on the device lustre-2-MDT0000: namespace.
# 2016-09-19 14:47:50

root@n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace
name: lfsck_namespace
magic: 0xa0629d03
version: 2
status: scanning-phase1
flags: inconsistent,upgrade
param: <NULL>
time_since_last_completed: 937163 seconds
time_since_latest_start: 3 seconds
time_since_last_checkpoint: 19 seconds
latest_start_position: 562060, [0x2000014f0:0x12ea5:0x0], 955476103740388659
last_checkpoint_position: 562059, [0x2000014f0:0x12ea5:0x0],
955442147668871112
first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877
checked_phase1: 419903
checked_phase2: 0
updated_phase1: 59322
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
dirs: 5506
M-linked: 0
nlinks_repaired: 0
lost_found: 0
success_count: 1
run_time_phase1: 23 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 18256 items/sec
average_speed_phase2: N/A
real-time_speed_phase1: 44125 items/sec
real-time_speed_phase2: N/A
current_position: 622857, N/A, N/A
# 2016-09-19 14:47:53

root@n00b:~ # lctl lfsck_stop --device lustre-2-MDT0000 Stopped LFSCK on the 
device lustre-2-MDT0000.
# 2016-09-19 14:47:56

root@n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace
name: lfsck_namespace
magic: 0xa0629d03
version: 2
status: stopped
flags: inconsistent,upgrade
param: <NULL>
time_since_last_completed: 937169 seconds
time_since_latest_start: 9 seconds
time_since_last_checkpoint: 3 seconds
latest_start_position: 562060, [0x2000014f0:0x12ea5:0x0], 955476103740388659
last_checkpoint_position: 646716, N/A, N/A
first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877
checked_phase1: 456954
checked_phase2: 0
updated_phase1: 59322
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
dirs: 5510
M-linked: 0
nlinks_repaired: 0
lost_found: 0
success_count: 1
run_time_phase1: 26 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 17575 items/sec
average_speed_phase2: 0 objs/sec
real-time_speed_phase1: N/A
real-time_speed_phase2: N/A
current_position: N/A
# 2016-09-19 14:47:59


Given my previous mileage, I'd rather not let LFSCK run to completion again.

Best,

Cédric


On 18/09/16 03:05, Yong, Fan wrote:
> To be clear, what are the console and proc output if "-n on" specified when 
> start the namespace LFSCK?
>
> --
> Regards,
> Nasf
>
> -----Original Message-----
> From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] 
> On Behalf Of Cédric Dufour - Idiap Research Institute
> Sent: Thursday, September 15, 2016 10:09 PM
> To: lustre-discuss@lists.lustre.org
> Subject: Re: [lustre-discuss] (LFSCK) LBUG: ASSERTION( 
> get_current()->journal_info == ((void *)0) ) failed - (ungracefully) 
> SOLVED
>
> Hello,
>
>
> After looking at the LFSCK source code - specifically 
> lustre/lfsck/lfsck_namespace.c - I was led to believe that suppressing the 
> "lfsck_namespace" file in the MDT LDISKFS should be safe enough (NB:
> I had a backup of the MDT on a disconnected DRBD peer for the worst 
> case
> scenario):
>
> # mount -t ldiskfs -o rw /dev/mdt /mnt/mdt.ldiskfs # rm -f 
> /mnt/mdt.ldiskfs/lfsck_namespace # umount /mnt/mdt.ldiskfs
>
>
> This allowed to start the MDT again. The "lfsck_namespace" reappeared with a 
> "blank" payload:
>
> # mount.lustre -o rw /dev/mdt /lustre/mdt
>
> # debugfs -c -R 'stat lfsck_namespace' /dev/drbd3 2>/dev/null
> Inode: 156   Type: regular    Mode:  0644   Flags: 0x0
> Generation: 1420451772    Version: 0x00000000:00000000
> User:     0   Group:     0   Size: 8192
> File ACL: 0    Directory ACL: 0
> Links: 1   Blockcount: 16
> Fragment:  Address: 0    Number: 0    Size: 0
>  ctime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016
>  atime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016
>  mtime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016
> crtime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016 Size of extra inode 
> fields: 28 Extended attributes stored in inode body:
>   lma = "00 00 00 00 00 00 00 00 03 00 00 00 02 00 00 00 0a 00 00 00 
> 00
> 00 00 00 " (24)
>   lma: fid=[0x200000003:0xa:0x0] compat=0 incompat=0
>   lfsck_namespace = "03 9d 62 a0 00 00 00 00 00 00 00 00 00 00 00 00 
> 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " 
> (256)
> BLOCKS:
> (0-1):178240-178241
> TOTAL: 2
>
> # cat /proc/fs/lustre/mdd/lustre-1-MDT0000/lfsck_namespace
> name: lfsck_namespace
> magic: 0xa0629d03
> version: 2
> status: init
> flags:
> param: <NULL>
> time_since_last_completed: N/A
> time_since_latest_start: N/A
> time_since_last_checkpoint: N/A
> latest_start_position: N/A, N/A, N/A
> last_checkpoint_position: N/A, N/A, N/A
> first_failure_position: N/A, N/A, N/A
> checked_phase1: 0
> checked_phase2: 0
> updated_phase1: 0
> updated_phase2: 0
> failed_phase1: 0
> failed_phase2: 0
> dirs: 0
> M-linked: 0
> nlinks_repaired: 0
> lost_found: 0
> success_count: 0
> run_time_phase1: 0 seconds
> run_time_phase2: 0 seconds
> average_speed_phase1: 0 items/sec
> average_speed_phase2: 0 objs/sec
> real-time_speed_phase1: N/A
> real-time_speed_phase2: N/A
> current_position: N/A
>
>
> Rather ungraceful - I don't know where that would have led me if I hadn't use 
> the "--dryrun on" flag on the LFSCK command (or was that flag actually 
> ignored? for it did not appear as a "param: ..." in the lfsck_namespace 
> output when LFSCK was running) - but got me out of the present deadlock.
>
> Forgot to mention: (still...) running Lustre 2.5.2
>
>
> Best,
>
>
> Cédric
>
>
>
> On 15/09/16 08:05, Cédric Dufour - Idiap Research Institute wrote:
>> Hello,
>>
>> On 14/09/16 20:58, Bernd Schubert wrote:
>>> Hi Cédric,
>>>
>>> I'm by no means familiar with Lustre code anymore, but based on the 
>>> stack trace and function names, it seems to be a problem with the 
>>> journal. Maybe try to do an 'efsck -f' which would replay the 
>>> journal and possibly clean up the file it has problem with.
>> Thanks for the tip.
>>
>> Unfortunately, I did perform a filesystem check as part of my attempts for 
>> recovery (and even ran a dry-run afterwards, to make sure no errors were 
>> dangling).
>>
>> Cédric
>>
>>
>>> Cheers,
>>> Bernd
>>>
>>>
>>> On Wednesday, September 14, 2016 9:28:38 AM CEST Cédric Dufour - 
>>> Idiap Research Institute wrote:
>>>> Hello,
>>>>
>>>> Last Friday, during normal operations, our MDS froze with the 
>>>> following LBUG, which happens again as soon as one mounts the MDT again:
>>>>
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.600584] LustreError:
>>>> 11696:0:(osd_handler.c:936:osd_trans_start()) ASSERTION( 
>>>> get_current()->journal_info == ((void *)0) ) failed: Sep 13 
>>>> 15:10:28 n00a kernel: [ 8414.612825] LustreError:
>>>> 11696:0:(osd_handler.c:936:osd_trans_start()) LBUG Sep 13 15:10:28 
>>>> n00a kernel: [ 8414.619833] Pid: 11696, comm: lfsck Sep 13 15:10:28 
>>>> n00a kernel: [ 8414.619835] Sep 13 15:10:28 n00a kernel:
>>>> [ 8414.619835] Call Trace:
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619850]  [<ffffffffa0224822>]
>>>> libcfs_debug_dumpstack+0x52/0x80 [libcfs] Sep 13 15:10:28 n00a
>>>> kernel: [ 8414.619857]  [<ffffffffa0224db2>]
>>>> lbug_with_loc+0x42/0xa0 [libcfs]
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619864]  [<ffffffffa0b11890>]
>>>> osd_trans_start+0x250/0x630 [osd_ldiskfs] Sep 13 15:10:28 n00a
>>>> kernel: [ 8414.619870]  [<ffffffffa0b0e748>] ?
>>>> osd_declare_xattr_set+0x58/0x230 [osd_ldiskfs] Sep 13 15:10:28 n00a
>>>> kernel: [ 8414.619876]  [<ffffffffa0c6ffc7>]
>>>> lod_trans_start+0x177/0x200 [lod]
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619881]  [<ffffffffa0cbd752>]
>>>> lfsck_namespace_double_scan+0x1122/0x1e50 [lfsck] Sep 13 15:10:28 
>>>> n00a kernel: [ 8414.619888]  [<ffffffff8136741b>] ?
>>>> thread_return+0x3e/0x10c
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619894]  [<ffffffff81038b87>] ?
>>>> enqueue_task_fair+0x58/0x5d
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619899]  [<ffffffffa0cb68ea>]
>>>> lfsck_double_scan+0x5a/0x70 [lfsck] Sep 13 15:10:28 n00a kernel: [ 
>>>> 8414.619904]  [<ffffffffa0cb7dfd>]
>>>> lfsck_master_engine+0x50d/0x650 [lfsck] Sep 13 15:10:28 n00a kernel: 
>>>> [ 8414.619909]  [<ffffffffa0cb78f0>] ?
>>>> lfsck_master_engine+0x0/0x650 [lfsck] Sep 13 15:10:28 n00a kernel: 
>>>> [ 8414.619915]  [<ffffffff810534c4>]
>>>> kthread+0x7b/0x83
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619918]  [<ffffffff810369d3>] ?
>>>> finish_task_switch+0x48/0xb9
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619924]  [<ffffffff8101092a>]
>>>> child_rip+0xa/0x20
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619928]  [<ffffffff81053449>] ?
>>>> kthread+0x0/0x83
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619931]  [<ffffffff81010920>] ?
>>>> child_rip+0x0/0x20
>>>>
>>>>
>>>> I originally had the LFSCK launched in "dry-run" mode:
>>>>
>>>> lctl lfsck_start --device lustre-1-MDT0000 --dryrun on --type 
>>>> namespace
>>>>
>>>> The LFSCK was reported completed (I was 'watch[ing] -n 1' on a
>>>> terminal) before the LBUG popped-up; now, I can't even get any 
>>>> output
>>>>
>>>> cat /proc/fs/lustre/mdd/lustre-1-MDT0000/lfsck_namespace  # just 
>>>> hang there indefinitely
>>>>
>>>>
>>>> I remember seing a lfsck_namespace file in the MDT underlyding 
>>>> LDISKFS; is there anything sensible I can do with it (e.g. would 
>>>> deleting it solve the situation) ?
>>>> What else could I do ?
>>>>
>>>>
>>>> Thanks for your answers and best regards,
>>>>
>>>> Cédric D.
>>>>
>>>>
>>>> PS: I had this message originally posted on HPDD-discuss mailing 
>>>> list and just realized it was the wrong place; sorry for any 
>>>> crossposting case _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-discuss@lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to