Hello, I think you hit the following bug: https://jira.whamcloud.com/browse/LU-15000 MDS crashes with (osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) failed
Stephane Thiell reported this issue and fixed it by patching his 2.12.7 version with https://review.whamcloud.com/46552 (2.15 backport: https://review.whamcloud.com/47515). A backport is issued for b2_15 branch but not yet landed: https://review.whamcloud.com/c/fs/lustre-release/+/48898 You could also check his LAD's presentation about removing OSTs (lctl del_ost): "A filesystem coming of age: live hardware upgrade practices at Stanford Research Computing" ( https://www.eofs.eu/_media/events/lad22/2.5-stanfordrc_s_thiell.pdf) Etienne AUJAMES On Tue, 2022-10-25 at 10:12 +0000, Redl, Robert wrote: > Dear Lustre Experts, > > some time ago we removed an OST. We followed the instructions from > the documentation ( > https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost > ) including cleaning up the logs from all related entries using > llog_cancel. After the removal the system worked normal. > > Now we are trying to add a new OST reusing the same index. If the OST > is created with mkfs.lustre --replace, then it is possible to mount > the OST, but it is not possible to mount the whole filesystem > anymore. A client would see the following error message: > > kernel: LustreError: > 70451:0:(obd_config.c:1499:class_process_config()) no device for: > project-OST0007-osc-ffff914108c2e800 > kernel: LustreError: > 70451:0:(obd_config.c:2001:class_config_llog_handler()) > MGC10.163.52.14@tcp: cfg command failed: rc = -22 > kernel: Lustre: cmd=cf00b 0:project-OST0007-osc 1: > 10.163.52.20@tcp > kernel: LustreError: 1760:0:(mgc_request.c:612:do_requeue()) failed > processing log: -22 > > In order to make the filesystem mountable again, all log entries > created by mounting the OST must be removed using llog_cancel. > > If the OST is created using mkfs.lustre without --replace, then the > OST itself is not mountable. The following error message is shown: > > kernel: LustreError: 140-5: Server project-OST0007 requested index 7, > but that index is already in use. Use --writeconf to force > kernel: LustreError: 7302:0:(mgs_handler.c:503:mgs_target_reg()) > Failed to write project-OST0007 log (-98) > > Given that the --writeconf suggested in the error message requires a > full shutdown of the system, we would like to avoid that. > > I wonder if we maybe overlooked something when the OST was removed. > The logs for project-client, project-MDT0000, and project-MDT0001 are > not showing any traces of the old OST anymore. Is there anything more > that needs to be done to make lustre forget that an OST with a given > index existed at some point? > > Lustre Version: 2.15.1, ZFS-backend. > > Thanks a lot! > Robert > > _______________________________________________ > lustre-discuss mailing list > [email protected] > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > > _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
