I tried tunefs.lustre --erase-params --writeconf the targets. Guess it
is not great because the clients were not unmounted, but I made sure
they are not trying to connect.
This makes it possible to mount the MDT, but when the first OST mount
starts the MDT has a lot of errors. After starting the second OST the
MDS crashes (syslog attached).
Cheers,
Hans Henrik
On 10.03.2022 15.48, Hans Henrik Happe via lustre-discuss wrote:
Sorry for all the mail load, but I hope this info can help figuring
out what's wrong and determine if this was caused by a bug. I think
I read the CONFIGS on the MDT with llog_reader. See attachments.
Cheers,
Hans Henrik
On 10.03.2022 12.23, Hans Henrik Happe via lustre-discuss wrote:
After upgrading to Lustre 2.12.8 I found that the first mount after a
reboot behaves differently:
Mounting mds02/astro0 on /mnt/lustre/local/astro-MDT0000
mount.lustre: mount mds02/astro0 at /mnt/lustre/local/astro-MDT0000
failed: No space left on device
And a different syslog output (attached syslog-0).
Doing the mount again has this error:
Mounting mds02/astro0 on /mnt/lustre/local/astro-MDT0000
mount.lustre: mount mds02/astro0 at /mnt/lustre/local/astro-MDT0000
failed: File exists
And a syslog like the one first posted. Attached the new output in
syslog-1.
Finally, stopping Lustre (Only MGS in this case) and the lnet service
does free resources making lustre_rmmod fail:
# lustre_rmmod
rmmod: ERROR: Module osp is in use
Cheers,
Hans Henrik
On 10.03.2022 11.15, Hans Henrik Happe via lustre-discuss wrote:
Forgot to say this is Lustre 2.12.6 and CentOS 7.9
(3.10.0-1160.6.1.el7.x86_64).
On 10.03.2022 10.27, Hans Henrik Happe via lustre-discuss wrote:
Hi,
A reboot of the MDS stalled and got forced reset. After that the
MDS would not start. The syslog is attached.
I'm not sure what the "class_register_device())
astro-OST0002-osc-MDT0000" part is supposed to do but astro-OST0002
is not mounted at this time. I guess this comes from the MGS.
Cheers,
Hans Henrik
Mar 11 12:42:04 mds02 kernel: Lustre: MGS: Logs for fs astro were removed by
user request. All servers must be restarted in order to regenerate the logs:
rc = 0
Mar 11 12:42:04 mds02 kernel: Lustre: astro-MDT0000: nosquash_nids set to
172.20.1.10@tcp1
Mar 11 12:42:04 mds02 kernel: Lustre: astro-MDT0000: Imperative Recovery not
enabled, recovery window 300-900
Mar 11 12:42:29 mds02 kernel: Lustre: astro-MDT0000: Connection restored to
0d2c198e-514c-3ae5-fc31-48e0424f131d (at 0@lo)
Mar 11 12:42:46 mds02 systemd: Started Session c4 of user root.
Mar 11 12:42:51 mds02 kernel: Lustre: MGS: Connection restored to
b11aa8af-1dd3-d728-0e81-6f595456b689 (at 10.21.10.114@o2ib)
Mar 11 12:42:51 mds02 kernel: Lustre: MGS: Regenerating astro-OST0000 log by
user request: rc = 0
Mar 11 12:42:58 mds02 kernel: Lustre:
10971:0:(llog_cat.c:93:llog_cat_new_log()) astro-OST0000-osc-MDT0000: there are
no more free slots in catalog [0x186:0x1:0x0]:0
Mar 11 12:42:58 mds02 kernel: LustreError:
10971:0:(osp_sync.c:1524:osp_sync_init()) astro-OST0000-osc-MDT0000: can't
initialize llog: rc = -28
Mar 11 12:42:58 mds02 kernel: LustreError:
10971:0:(obd_config.c:559:class_setup()) setup astro-OST0000-osc-MDT0000 failed
(-28)
Mar 11 12:42:58 mds02 kernel: LustreError:
10971:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.21.10.102@o2ib:
cfg command failed: rc = -28
Mar 11 12:42:58 mds02 kernel: Lustre: cmd=cf003 0:astro-OST0000-osc-MDT0000
1:astro-OST0000_UUID 2:10.21.10.114@o2ib
Mar 11 12:42:58 mds02 kernel: LustreError:
9282:0:(mgc_request.c:599:do_requeue()) failed processing log: -28
Mar 11 12:44:16 mds02 kernel: Lustre: MGS: Connection restored to
9842fe3a-0ff5-afc6-292f-cff60a4897ba (at 10.21.10.115@o2ib)
Mar 11 12:44:16 mds02 kernel: Lustre: Skipped 1 previous similar message
Mar 11 12:44:16 mds02 kernel: Lustre: MGS: Regenerating astro-OST0001 log by
user request: rc = 0
Mar 11 12:44:25 mds02 kernel: LustreError:
11466:0:(obd_config.c:764:class_add_conn()) try to add conn on immature client
dev
Message from syslogd@mds02 at Mar 11 12:44:25 ...
kernel:LustreError: 11466:0:(lod_lov.c:244:lod_add_device()) ASSERTION(
obd->obd_lu_dev->ld_site == lod->lod_dt_dev.dd_lu_dev.ld_site ) failed:
Mar 11 12:44:25 mds02 kernel: LustreError:
11466:0:(lod_lov.c:244:lod_add_device()) ASSERTION( obd->obd_lu_dev->ld_site ==
lod->lod_dt_dev.dd_lu_dev.ld_site ) failed:
Message from syslogd@mds02 at Mar 11 12:44:25 ...
kernel:LustreError: 11466:0:(lod_lov.c:244:lod_add_device()) LBUG
Mar 11 12:44:25 mds02 kernel: LustreError:
11466:0:(lod_lov.c:244:lod_add_device()) LBUG
Mar 11 12:44:25 mds02 kernel: Pid: 11466, comm: llog_process_th
3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021
Mar 11 12:44:25 mds02 kernel: Call Trace:
Mar 11 12:44:25 mds02 kernel: [<ffffffffc095a7cc>] libcfs_call_trace+0x8c/0xc0
[libcfs]
Mar 11 12:44:25 mds02 kernel: [<ffffffffc095a87c>] lbug_with_loc+0x4c/0xa0
[libcfs]
Mar 11 12:44:25 mds02 kernel: [<ffffffffc0ec0f1a>] lod_add_device+0x195a/0x19a0
[lod]
Mar 11 12:44:25 mds02 kernel: [<ffffffffc0ebb895>]
lod_process_config+0x13b5/0x1510 [lod]
Mar 11 12:44:25 mds02 kernel: [<ffffffffc13eeaf2>]
class_process_config+0x2142/0x2830 [obdclass]
Mar 11 12:44:25 mds02 kernel: [<ffffffffc13f0db9>]
class_config_llog_handler+0x819/0x1520 [obdclass]
Mar 11 12:44:25 mds02 kernel: [<ffffffffc13b37d4>]
llog_process_thread+0x8e4/0x19c0 [obdclass]
Mar 11 12:44:25 mds02 kernel: [<ffffffffc13b52c4>]
llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
Mar 11 12:44:25 mds02 kernel: [<ffffffff820c5e61>] kthread+0xd1/0xe0
Mar 11 12:44:25 mds02 kernel: [<ffffffff82795ddd>]
ret_from_fork_nospec_begin+0x7/0x21
Mar 11 12:44:25 mds02 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org