[lustre-discuss] MDT crashes from assert when an OST mounts

Makia Minich Tue, 20 Aug 2024 06:39:47 -0700

Wondering if others may have seen something or know of a remedy.

Late last week we had a room lose power which meant the filesystem took a hard 
crash. When power was restored it looked like the JBODS made it through and all 
of the luns appear to be healthy (after a little bit of rebuilding). The 
servers were also able to successfully see the luns, so all looked like it was 
going better than anticipated.


The system (both server and clients) is CentOS 7.9 with Lustre 2.12.7.

Bringing up the filesystem is when things went sideways. The MGT mounted with 
no issue (standard messages of recovery), the MDT also mounted. We proceeded to 
mount the OSTs when we noticed that suddenly the MDS rebooted with a kernel 
panic. Looking at dmesg (after it was brought back up) we found the following 
message:

[ 6867.143694] Lustre: 79890:0:(llog.c:615:llog_process_thread()) 
lustre01-OST0032-osc-MDT0000: invalid length 0 in llog [0x52ab:0x1:0x0]record 
for index 0/2
[ 6867.143705] Lustre: 79890:0:(llog.c:615:llog_process_thread()) Skipped 1 
previous similar message
[ 6867.143720] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) 
lustre01-OST0032-osc-MDT0000: llog process with osp_sync_process_queues failed: 
-22
[ 6867.148800] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) Skipped 
1 previous similar message

After a few attempts (hoping it was a fluke) the same message would cause an 
assert, we had noticed this occurred with two specific OSTs. Leaving those two 
OSTs down we were able to bring up the rest of the filesystem successfully, but 
when either of those are mounted it appears that something is triggered and the 
MDT crashes. Looking at the OSS, there's no messages on the OSS other than 
losing connection to the MGS (due to the crash).

We've tried clearing the updatelog and changelog with no change in behavior. 
So, any other ideas would be appreciated.

Below is the full dmesg from the start of mounting the MGT:

[ 4881.624345] LDISKFS-fs (scinia): mounted filesystem with ordered data mode. 
Opts: (null)
[ 6844.490777] LDISKFS-fs (scinib): mounted filesystem with ordered data mode. 
Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[ 6845.003014] Lustre: MGS: Connection restored to MGC192.168.240.7@tcp1_0 (at 
0@lo)
[ 6845.003021] Lustre: Skipped 1 previous similar message
[ 6853.385804] Lustre: MGS: Connection restored to 
b22a0a27-e8e5-a57b-534e-d5f9571b6e9f (at 192.168.9.30@tcp4)
[ 6865.882492] LDISKFS-fs (scinia): mounted filesystem with ordered data mode. 
Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[ 6867.143694] Lustre: 79890:0:(llog.c:615:llog_process_thread()) 
lustre01-OST0032-osc-MDT0000: invalid length 0 in llog [0x52ab:0x1:0x0]record 
for index 0/2
[ 6867.143705] Lustre: 79890:0:(llog.c:615:llog_process_thread()) Skipped 1 
previous similar message
[ 6867.143720] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) 
lustre01-OST0032-osc-MDT0000: llog process with osp_sync_process_queues failed: 
-22
[ 6867.148800] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) Skipped 
1 previous similar message
[ 6867.221923] Lustre: lustre01-MDT0000: Imperative Recovery not enabled, 
recovery window 300-900
[ 6867.234362] Lustre: lustre01-MDT0000: in recovery but waiting for the first 
client to connect
[ 6872.207528] Lustre: lustre01-MDT0000: Connection restored to 
MGC192.168.240.7@tcp1_0 (at 0@lo)
[ 6872.207536] Lustre: Skipped 1 previous similar message
[ 6902.340582] Lustre: lustre01-MDT0000: Will be in recovery for at least 5:00, 
or until 7 clients reconnect
[ 6908.270425] Lustre: lustre01-MDT0000: Connection restored to 
b22a0a27-e8e5-a57b-534e-d5f9571b6e9f (at 192.168.249.30@tcp2)
[ 6908.270429] Lustre: Skipped 4 previous similar messages
[ 6908.446460] Lustre: lustre01-MDT0000: Recovery over after 0:06, of 7 clients 
7 recovered and 0 were evicted.
[ 6977.979707] perf: interrupt took too long (2501 > 2500), lowering 
kernel.perf_event_max_sample_rate to 79000
[ 6984.953509] Lustre: MGS: Connection restored to 
83b8e6bc-7407-a532-4b8e-0ae1a4982885 (at 192.168.240.8@tcp1)
[ 6984.953517] Lustre: Skipped 2 previous similar messages
[ 7115.328345] Lustre: MGS: Connection restored to 
1ad84e77-29b8-8d86-73e4-7dcd263c303b (at 192.168.240.9@tcp1)
[ 7115.328352] Lustre: Skipped 16 previous similar messages
[ 7201.690060] Lustre: 79892:0:(llog.c:615:llog_process_thread()) 
lustre01-OST0033-osc-MDT0000: invalid length 0 in llog [0x52ad:0x1:0x0]record 
for index 0/1
[ 7201.690069] Lustre: 79892:0:(llog.c:615:llog_process_thread()) Skipped 1 
previous similar message
[ 7201.690086] LustreError: 79892:0:(osp_sync.c:1272:osp_sync_thread()) 
lustre01-OST0033-osc-MDT0000: llog process with osp_sync_process_queues failed: 
-22
[ 7201.695902] LustreError: 79892:0:(osp_sync.c:1317:osp_sync_thread()) 
ASSERTION( atomic_read(&d->opd_sync_rpcs_in_progress) == 0 ) failed: 
lustre01-OST0033-osc-MDT0000: 1 0 !empty
[ 7201.701242] LustreError: 79892:0:(osp_sync.c:1317:osp_sync_thread()) LBUG
[ 7201.703862] Pid: 79892, comm: osp-syn-51-0 3.10.0-1160.21.1.el7.x86_64 #1 
SMP Tue Mar 16 18:28:22 UTC 2021
[ 7201.703865] Call Trace:
[ 7201.703877]  [<ffffffffc0f007cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 7201.703896]  [<ffffffffc0f0087c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 7201.703909]  [<ffffffffc1c8e5c8>] osp_sync_thread+0xb78/0xb80 [osp]
[ 7201.703926]  [<ffffffffbd2c5da1>] kthread+0xd1/0xe0
[ 7201.703937]  [<ffffffffbd995df7>] ret_from_fork_nospec_end+0x0/0x39
[ 7201.703945]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 7201.703984] Kernel panic - not syncing: LBUG
[ 7201.706561] CPU: 37 PID: 79892 Comm: osp-syn-51-0 Kdump: loaded Tainted: P   
        OE  ------------   3.10.0-1160.21.1.el7.x86_64 #1
[ 7201.711716] Hardware name: Dell Inc. VxFlex integrated rack R640 S/0H28RR, 
BIOS 2.9.4 11/06/2020
[ 7201.714311] Call Trace:
[ 7201.716865]  [<ffffffffbd98305a>] dump_stack+0x19/0x1b
[ 7201.719418]  [<ffffffffbd97c5b2>] panic+0xe8/0x21f
[ 7201.721938]  [<ffffffffc0f008cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[ 7201.724425]  [<ffffffffc1c8e5c8>] osp_sync_thread+0xb78/0xb80 [osp]
[ 7201.726873]  [<ffffffffbd98899f>] ? __schedule+0x3af/0x860
[ 7201.729286]  [<ffffffffc1c8da50>] ? osp_sync_process_committed+0x700/0x700 
[osp]
[ 7201.731672]  [<ffffffffbd2c5da1>] kthread+0xd1/0xe0
[ 7201.734016]  [<ffffffffbd2c5cd0>] ? insert_kthread_work+0x40/0x40
[ 7201.736329]  [<ffffffffbd995df7>] ret_from_fork_nospec_begin+0x21/0x21
[ 7201.738617]  [<ffffffffbd2c5cd0>] ? insert_kthread_work+0x40/0x40

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] MDT crashes from assert when an OST mounts

Reply via email to