Wondering if others may have seen something or know of a remedy.
Late last week we had a room lose power which meant the filesystem took a hard
crash. When power was restored it looked like the JBODS made it through and all
of the luns appear to be healthy (after a little bit of rebuilding). The
servers were also able to successfully see the luns, so all looked like it was
going better than anticipated.
The system (both server and clients) is CentOS 7.9 with Lustre 2.12.7.
Bringing up the filesystem is when things went sideways. The MGT mounted with
no issue (standard messages of recovery), the MDT also mounted. We proceeded to
mount the OSTs when we noticed that suddenly the MDS rebooted with a kernel
panic. Looking at dmesg (after it was brought back up) we found the following
message:
[ 6867.143694] Lustre: 79890:0:(llog.c:615:llog_process_thread())
lustre01-OST0032-osc-MDT0000: invalid length 0 in llog [0x52ab:0x1:0x0]record
for index 0/2
[ 6867.143705] Lustre: 79890:0:(llog.c:615:llog_process_thread()) Skipped 1
previous similar message
[ 6867.143720] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread())
lustre01-OST0032-osc-MDT0000: llog process with osp_sync_process_queues failed:
-22
[ 6867.148800] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) Skipped
1 previous similar message
After a few attempts (hoping it was a fluke) the same message would cause an
assert, we had noticed this occurred with two specific OSTs. Leaving those two
OSTs down we were able to bring up the rest of the filesystem successfully, but
when either of those are mounted it appears that something is triggered and the
MDT crashes. Looking at the OSS, there's no messages on the OSS other than
losing connection to the MGS (due to the crash).
We've tried clearing the updatelog and changelog with no change in behavior.
So, any other ideas would be appreciated.
Below is the full dmesg from the start of mounting the MGT:
[ 4881.624345] LDISKFS-fs (scinia): mounted filesystem with ordered data mode.
Opts: (null)
[ 6844.490777] LDISKFS-fs (scinib): mounted filesystem with ordered data mode.
Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[ 6845.003014] Lustre: MGS: Connection restored to MGC192.168.240.7@tcp1_0 (at
0@lo)
[ 6845.003021] Lustre: Skipped 1 previous similar message
[ 6853.385804] Lustre: MGS: Connection restored to
b22a0a27-e8e5-a57b-534e-d5f9571b6e9f (at 192.168.9.30@tcp4)
[ 6865.882492] LDISKFS-fs (scinia): mounted filesystem with ordered data mode.
Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[ 6867.143694] Lustre: 79890:0:(llog.c:615:llog_process_thread())
lustre01-OST0032-osc-MDT0000: invalid length 0 in llog [0x52ab:0x1:0x0]record
for index 0/2
[ 6867.143705] Lustre: 79890:0:(llog.c:615:llog_process_thread()) Skipped 1
previous similar message
[ 6867.143720] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread())
lustre01-OST0032-osc-MDT0000: llog process with osp_sync_process_queues failed:
-22
[ 6867.148800] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) Skipped
1 previous similar message
[ 6867.221923] Lustre: lustre01-MDT0000: Imperative Recovery not enabled,
recovery window 300-900
[ 6867.234362] Lustre: lustre01-MDT0000: in recovery but waiting for the first
client to connect
[ 6872.207528] Lustre: lustre01-MDT0000: Connection restored to
MGC192.168.240.7@tcp1_0 (at 0@lo)
[ 6872.207536] Lustre: Skipped 1 previous similar message
[ 6902.340582] Lustre: lustre01-MDT0000: Will be in recovery for at least 5:00,
or until 7 clients reconnect
[ 6908.270425] Lustre: lustre01-MDT0000: Connection restored to
b22a0a27-e8e5-a57b-534e-d5f9571b6e9f (at 192.168.249.30@tcp2)
[ 6908.270429] Lustre: Skipped 4 previous similar messages
[ 6908.446460] Lustre: lustre01-MDT0000: Recovery over after 0:06, of 7 clients
7 recovered and 0 were evicted.
[ 6977.979707] perf: interrupt took too long (2501 > 2500), lowering
kernel.perf_event_max_sample_rate to 79000
[ 6984.953509] Lustre: MGS: Connection restored to
83b8e6bc-7407-a532-4b8e-0ae1a4982885 (at 192.168.240.8@tcp1)
[ 6984.953517] Lustre: Skipped 2 previous similar messages
[ 7115.328345] Lustre: MGS: Connection restored to
1ad84e77-29b8-8d86-73e4-7dcd263c303b (at 192.168.240.9@tcp1)
[ 7115.328352] Lustre: Skipped 16 previous similar messages
[ 7201.690060] Lustre: 79892:0:(llog.c:615:llog_process_thread())
lustre01-OST0033-osc-MDT0000: invalid length 0 in llog [0x52ad:0x1:0x0]record
for index 0/1
[ 7201.690069] Lustre: 79892:0:(llog.c:615:llog_process_thread()) Skipped 1
previous similar message
[ 7201.690086] LustreError: 79892:0:(osp_sync.c:1272:osp_sync_thread())
lustre01-OST0033-osc-MDT0000: llog process with osp_sync_process_queues failed:
-22
[ 7201.695902] LustreError: 79892:0:(osp_sync.c:1317:osp_sync_thread())
ASSERTION( atomic_read(&d->opd_sync_rpcs_in_progress) == 0 ) failed:
lustre01-OST0033-osc-MDT0000: 1 0 !empty
[ 7201.701242] LustreError: 79892:0:(osp_sync.c:1317:osp_sync_thread()) LBUG
[ 7201.703862] Pid: 79892, comm: osp-syn-51-0 3.10.0-1160.21.1.el7.x86_64 #1
SMP Tue Mar 16 18:28:22 UTC 2021
[ 7201.703865] Call Trace:
[ 7201.703877] [<ffffffffc0f007cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 7201.703896] [<ffffffffc0f0087c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 7201.703909] [<ffffffffc1c8e5c8>] osp_sync_thread+0xb78/0xb80 [osp]
[ 7201.703926] [<ffffffffbd2c5da1>] kthread+0xd1/0xe0
[ 7201.703937] [<ffffffffbd995df7>] ret_from_fork_nospec_end+0x0/0x39
[ 7201.703945] [<ffffffffffffffff>] 0xffffffffffffffff
[ 7201.703984] Kernel panic - not syncing: LBUG
[ 7201.706561] CPU: 37 PID: 79892 Comm: osp-syn-51-0 Kdump: loaded Tainted: P
OE ------------ 3.10.0-1160.21.1.el7.x86_64 #1
[ 7201.711716] Hardware name: Dell Inc. VxFlex integrated rack R640 S/0H28RR,
BIOS 2.9.4 11/06/2020
[ 7201.714311] Call Trace:
[ 7201.716865] [<ffffffffbd98305a>] dump_stack+0x19/0x1b
[ 7201.719418] [<ffffffffbd97c5b2>] panic+0xe8/0x21f
[ 7201.721938] [<ffffffffc0f008cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[ 7201.724425] [<ffffffffc1c8e5c8>] osp_sync_thread+0xb78/0xb80 [osp]
[ 7201.726873] [<ffffffffbd98899f>] ? __schedule+0x3af/0x860
[ 7201.729286] [<ffffffffc1c8da50>] ? osp_sync_process_committed+0x700/0x700
[osp]
[ 7201.731672] [<ffffffffbd2c5da1>] kthread+0xd1/0xe0
[ 7201.734016] [<ffffffffbd2c5cd0>] ? insert_kthread_work+0x40/0x40
[ 7201.736329] [<ffffffffbd995df7>] ret_from_fork_nospec_begin+0x21/0x21
[ 7201.738617] [<ffffffffbd2c5cd0>] ? insert_kthread_work+0x40/0x40
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org