Dear Etienne, thanks a lot! We do actually not have MDS crashes as described in LU-15000, but we do of course have several index gaps caused by llog_cancel.
Is it necessary to have this patch on all servers, or is only the MGS affected? About mkfs.lustre --replace: why is the --replace required if all traces of the old OST have been removed from the config log? Are indices that have been used before stored somewhere else? Best regards, Robert > Am 25.10.2022 um 14:15 schrieb Etienne Aujames <[email protected]>: > > Hello, > > I think you hit the following bug: > https://jira.whamcloud.com/browse/LU-15000 MDS crashes with > (osp_dev.c:1404:osp_obd_connect()) ASSERTION( osp->opd_connects == 1 ) > failed > > Stephane Thiell reported this issue and fixed it by patching his 2.12.7 > version with https://review.whamcloud.com/46552 (2.15 backport: > https://review.whamcloud.com/47515). > > A backport is issued for b2_15 branch but not yet landed: > https://review.whamcloud.com/c/fs/lustre-release/+/48898 > > You could also check his LAD's presentation about removing OSTs (lctl > del_ost): > "A filesystem coming of age: live hardware upgrade practices at > Stanford Research Computing" ( > https://www.eofs.eu/_media/events/lad22/2.5-stanfordrc_s_thiell.pdf) > > Etienne AUJAMES > > On Tue, 2022-10-25 at 10:12 +0000, Redl, Robert wrote: >> Dear Lustre Experts, >> >> some time ago we removed an OST. We followed the instructions from >> the documentation ( >> https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost >> ) including cleaning up the logs from all related entries using >> llog_cancel. After the removal the system worked normal. >> >> Now we are trying to add a new OST reusing the same index. If the OST >> is created with mkfs.lustre --replace, then it is possible to mount >> the OST, but it is not possible to mount the whole filesystem >> anymore. A client would see the following error message: >> >> kernel: LustreError: >> 70451:0:(obd_config.c:1499:class_process_config()) no device for: >> project-OST0007-osc-ffff914108c2e800 >> kernel: LustreError: >> 70451:0:(obd_config.c:2001:class_config_llog_handler()) >> MGC10.163.52.14@tcp: cfg command failed: rc = -22 >> kernel: Lustre: cmd=cf00b 0:project-OST0007-osc 1: >> 10.163.52.20@tcp >> kernel: LustreError: 1760:0:(mgc_request.c:612:do_requeue()) failed >> processing log: -22 >> >> In order to make the filesystem mountable again, all log entries >> created by mounting the OST must be removed using llog_cancel. >> >> If the OST is created using mkfs.lustre without --replace, then the >> OST itself is not mountable. The following error message is shown: >> >> kernel: LustreError: 140-5: Server project-OST0007 requested index 7, >> but that index is already in use. Use --writeconf to force >> kernel: LustreError: 7302:0:(mgs_handler.c:503:mgs_target_reg()) >> Failed to write project-OST0007 log (-98) >> >> Given that the --writeconf suggested in the error message requires a >> full shutdown of the system, we would like to avoid that. >> >> I wonder if we maybe overlooked something when the OST was removed. >> The logs for project-client, project-MDT0000, and project-MDT0001 are >> not showing any traces of the old OST anymore. Is there anything more >> that needs to be done to make lustre forget that an OST with a given >> index existed at some point? >> >> Lustre Version: 2.15.1, ZFS-backend. >> >> Thanks a lot! >> Robert >> >> _______________________________________________ >> lustre-discuss mailing list >> [email protected] >> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >> >>
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
