I'm happy to that the problem seems to be solved by deleting the CATALOGS file on the underlying MDT ZFS fs. As I gather from the manual [1] this should not be a problem, because it will be handled by LFSCK.

If I'm wrong about this, please let me know. Also, I'm happy to provide any information from this MDT to help asses if there is a bug somewhere.

LFSCK is running as we speak.

Cheers,
Hans Henrik

[1] https://doc.lustre.org/lustre_manual.xhtml#backup_fs_level.restore

On 11.03.2022 12.49, Hans Henrik Happe via lustre-discuss wrote:
I tried tunefs.lustre --erase-params --writeconf the targets. Guess it is not great because the clients were not unmounted, but I made sure they are not trying to connect.

This makes it possible to mount the MDT, but when the first OST mount starts the MDT has a lot of errors. After starting the second OST the MDS crashes (syslog attached).

Cheers,
Hans Henrik

On 10.03.2022 15.48, Hans Henrik Happe via lustre-discuss wrote:
Sorry for all the mail load, but I hope this info can help figuring out what's wrong and determine if this was caused by a bug. I think

I read the CONFIGS on the MDT with llog_reader. See attachments.

Cheers,
Hans Henrik

On 10.03.2022 12.23, Hans Henrik Happe via lustre-discuss wrote:
After upgrading to Lustre 2.12.8 I found that the first mount after a reboot behaves differently:

Mounting mds02/astro0 on /mnt/lustre/local/astro-MDT0000
mount.lustre: mount mds02/astro0 at /mnt/lustre/local/astro-MDT0000 failed: No space left on device

And a different syslog output (attached syslog-0).

Doing the mount again has this error:

Mounting mds02/astro0 on /mnt/lustre/local/astro-MDT0000
mount.lustre: mount mds02/astro0 at /mnt/lustre/local/astro-MDT0000 failed: File exists

And a syslog like the one first posted. Attached the new output in syslog-1.

Finally, stopping Lustre (Only MGS in this case) and the lnet service does free resources making lustre_rmmod fail:

# lustre_rmmod
rmmod: ERROR: Module osp is in use


Cheers,
Hans Henrik

On 10.03.2022 11.15, Hans Henrik Happe via lustre-discuss wrote:
Forgot to say this is Lustre 2.12.6 and CentOS 7.9 (3.10.0-1160.6.1.el7.x86_64).

On 10.03.2022 10.27, Hans Henrik Happe via lustre-discuss wrote:
Hi,

A reboot of the MDS stalled and got forced reset. After that the MDS would not start. The syslog is attached.

I'm not sure what the "class_register_device()) astro-OST0002-osc-MDT0000" part is supposed to do but astro-OST0002 is not mounted at this time. I guess this comes from the MGS.

Cheers,
Hans Henrik





_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to