I tried tunefs.lustre --erase-params --writeconf the targets. Guess it is not great because the clients were not unmounted, but I made sure they are not trying to connect.

This makes it possible to mount the MDT, but when the first OST mount starts the MDT has a lot of errors. After starting the second OST the MDS crashes (syslog attached).

Cheers,
Hans Henrik

On 10.03.2022 15.48, Hans Henrik Happe via lustre-discuss wrote:
Sorry for all the mail load, but I hope this info can help figuring out what's wrong and determine if this was caused by a bug. I think

I read the CONFIGS on the MDT with llog_reader. See attachments.

Cheers,
Hans Henrik

On 10.03.2022 12.23, Hans Henrik Happe via lustre-discuss wrote:
After upgrading to Lustre 2.12.8 I found that the first mount after a reboot behaves differently:

Mounting mds02/astro0 on /mnt/lustre/local/astro-MDT0000
mount.lustre: mount mds02/astro0 at /mnt/lustre/local/astro-MDT0000 failed: No space left on device

And a different syslog output (attached syslog-0).

Doing the mount again has this error:

Mounting mds02/astro0 on /mnt/lustre/local/astro-MDT0000
mount.lustre: mount mds02/astro0 at /mnt/lustre/local/astro-MDT0000 failed: File exists

And a syslog like the one first posted. Attached the new output in syslog-1.

Finally, stopping Lustre (Only MGS in this case) and the lnet service does free resources making lustre_rmmod fail:

# lustre_rmmod
rmmod: ERROR: Module osp is in use


Cheers,
Hans Henrik

On 10.03.2022 11.15, Hans Henrik Happe via lustre-discuss wrote:
Forgot to say this is Lustre 2.12.6 and CentOS 7.9 (3.10.0-1160.6.1.el7.x86_64).

On 10.03.2022 10.27, Hans Henrik Happe via lustre-discuss wrote:
Hi,

A reboot of the MDS stalled and got forced reset. After that the MDS would not start. The syslog is attached.

I'm not sure what the "class_register_device()) astro-OST0002-osc-MDT0000" part is supposed to do but astro-OST0002 is not mounted at this time. I guess this comes from the MGS.

Cheers,
Hans Henrik



Mar 11 12:42:04 mds02 kernel: Lustre: MGS: Logs for fs astro were removed by 
user request.  All servers must be restarted in order to regenerate the logs: 
rc = 0
Mar 11 12:42:04 mds02 kernel: Lustre: astro-MDT0000: nosquash_nids set to 
172.20.1.10@tcp1
Mar 11 12:42:04 mds02 kernel: Lustre: astro-MDT0000: Imperative Recovery not 
enabled, recovery window 300-900
Mar 11 12:42:29 mds02 kernel: Lustre: astro-MDT0000: Connection restored to 
0d2c198e-514c-3ae5-fc31-48e0424f131d (at 0@lo)
Mar 11 12:42:46 mds02 systemd: Started Session c4 of user root.
Mar 11 12:42:51 mds02 kernel: Lustre: MGS: Connection restored to 
b11aa8af-1dd3-d728-0e81-6f595456b689 (at 10.21.10.114@o2ib)
Mar 11 12:42:51 mds02 kernel: Lustre: MGS: Regenerating astro-OST0000 log by 
user request: rc = 0
Mar 11 12:42:58 mds02 kernel: Lustre: 
10971:0:(llog_cat.c:93:llog_cat_new_log()) astro-OST0000-osc-MDT0000: there are 
no more free slots in catalog [0x186:0x1:0x0]:0
Mar 11 12:42:58 mds02 kernel: LustreError: 
10971:0:(osp_sync.c:1524:osp_sync_init()) astro-OST0000-osc-MDT0000: can't 
initialize llog: rc = -28
Mar 11 12:42:58 mds02 kernel: LustreError: 
10971:0:(obd_config.c:559:class_setup()) setup astro-OST0000-osc-MDT0000 failed 
(-28)
Mar 11 12:42:58 mds02 kernel: LustreError: 
10971:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.21.10.102@o2ib: 
cfg command failed: rc = -28
Mar 11 12:42:58 mds02 kernel: Lustre:    cmd=cf003 0:astro-OST0000-osc-MDT0000  
1:astro-OST0000_UUID  2:10.21.10.114@o2ib  
Mar 11 12:42:58 mds02 kernel: LustreError: 
9282:0:(mgc_request.c:599:do_requeue()) failed processing log: -28
Mar 11 12:44:16 mds02 kernel: Lustre: MGS: Connection restored to 
9842fe3a-0ff5-afc6-292f-cff60a4897ba (at 10.21.10.115@o2ib)
Mar 11 12:44:16 mds02 kernel: Lustre: Skipped 1 previous similar message
Mar 11 12:44:16 mds02 kernel: Lustre: MGS: Regenerating astro-OST0001 log by 
user request: rc = 0
Mar 11 12:44:25 mds02 kernel: LustreError: 
11466:0:(obd_config.c:764:class_add_conn()) try to add conn on immature client 
dev

Message from syslogd@mds02 at Mar 11 12:44:25 ...
 kernel:LustreError: 11466:0:(lod_lov.c:244:lod_add_device()) ASSERTION( 
obd->obd_lu_dev->ld_site == lod->lod_dt_dev.dd_lu_dev.ld_site ) failed: 
Mar 11 12:44:25 mds02 kernel: LustreError: 
11466:0:(lod_lov.c:244:lod_add_device()) ASSERTION( obd->obd_lu_dev->ld_site == 
lod->lod_dt_dev.dd_lu_dev.ld_site ) failed: 

Message from syslogd@mds02 at Mar 11 12:44:25 ...
 kernel:LustreError: 11466:0:(lod_lov.c:244:lod_add_device()) LBUG
Mar 11 12:44:25 mds02 kernel: LustreError: 
11466:0:(lod_lov.c:244:lod_add_device()) LBUG
Mar 11 12:44:25 mds02 kernel: Pid: 11466, comm: llog_process_th 
3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021
Mar 11 12:44:25 mds02 kernel: Call Trace:
Mar 11 12:44:25 mds02 kernel: [<ffffffffc095a7cc>] libcfs_call_trace+0x8c/0xc0 
[libcfs]
Mar 11 12:44:25 mds02 kernel: [<ffffffffc095a87c>] lbug_with_loc+0x4c/0xa0 
[libcfs]
Mar 11 12:44:25 mds02 kernel: [<ffffffffc0ec0f1a>] lod_add_device+0x195a/0x19a0 
[lod]
Mar 11 12:44:25 mds02 kernel: [<ffffffffc0ebb895>] 
lod_process_config+0x13b5/0x1510 [lod]
Mar 11 12:44:25 mds02 kernel: [<ffffffffc13eeaf2>] 
class_process_config+0x2142/0x2830 [obdclass]
Mar 11 12:44:25 mds02 kernel: [<ffffffffc13f0db9>] 
class_config_llog_handler+0x819/0x1520 [obdclass]
Mar 11 12:44:25 mds02 kernel: [<ffffffffc13b37d4>] 
llog_process_thread+0x8e4/0x19c0 [obdclass]
Mar 11 12:44:25 mds02 kernel: [<ffffffffc13b52c4>] 
llog_process_thread_daemonize+0xa4/0xe0 [obdclass]
Mar 11 12:44:25 mds02 kernel: [<ffffffff820c5e61>] kthread+0xd1/0xe0
Mar 11 12:44:25 mds02 kernel: [<ffffffff82795ddd>] 
ret_from_fork_nospec_begin+0x7/0x21
Mar 11 12:44:25 mds02 kernel: [<ffffffffffffffff>] 0xffffffffffffffff

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to