Hi! Yes, thank you very much Finally after recreating and yet data on it, we realized we never rebooted the IO nodes !!! This is the answer, or at least a calming, feasible try to explain đ Have a nice weekend
From: gpfsug-discuss <[email protected]> On Behalf Of Jan-Frode Myklebust Sent: Donnerstag, 24. August 2023 21:56 To: gpfsug main discussion list <[email protected]> Subject: Re: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned permanently mmvdisk rg change âactive is a very common operation. It should be perfectly safe. mmvdisk rg change ârestart is an option I didnât know about, so likely not something thatâs commonly used. I wouldnât be too worried about losing the RGs. I donât think thatâs something that can happen without support being able to help getting it back online. Once Iâve had a situation similar to your RG not wanting to become active again during an upgrade (around 5 years ago), and I believe we solved it by rebooting the io-nodes â must have been some stuck process I was unable to understand⌠or was it a CCR issue caused by some nodes being way back-level..? Donât remember. -jf tor. 24. aug. 2023 kl. 20:22 skrev Walter Sklenka <[email protected]<mailto:[email protected]>>: Hi Jan-Frode! We did the âswitchâ with mmvdisk rg change ârg ess3500_ess_n1_hs_ess_n2_hs âactive ess-n2-hs â Both nodes were up and we did not see any anomalies. And the rg was successfully created with the log groups Maybe the method to switch the rg (with âactive) is a bad idea? (because manuals says: https://www.ibm.com/docs/en/ess/6.1.6_lts?topic=command-mmvdisk-recoverygroup For a shared recovery group, the mmvdisk recoverygroup change --active Node command means to make the specified node the server for all four user log groups and the root log group. The specified node therefore temporarily becomes the sole active server for the entire shared recovery group, leaving the other server idle. This should only be done in unusual maintenance situations, since it is normally considered an error condition for one of the servers of a shared recovery group to be idle. If the keyword DEFAULT is used instead of a server name, it restores the normal default balance of log groups, making each of the two servers responsible for two user log groups. this was the state before we tried to restart , no log are seen, we got âunable to reset server listâ ~]$ sudo mmvdisk server list --rg ess3500_ess_n1_hs_ess_n2_hs node number server active remarks ------ -------------------------------- ------- ------- 98 ess-n1-hs yes configured 99 ess-n2-hs yes configured ~]$ sudo mmvdisk recoverygroup list --rg ess3500_ess_n1_hs_ess_n2_hs needs user recovery group node class active current or master server service vdisks remarks ----------------------------------- ---------- ------- -------------------------------- ------- ------ ------- ess3500_ess_n1_hs_ess_n2_hs ess3500_mmvdisk_ess_n1_hs_ess_n2_hs no - unknown 0 ~]$ ^C ~]$ sudo mmvdisk rg change --rg ess3500_ess_n1_hs_ess_n2_hs --restart mmvdisk: mmvdisk: mmvdisk: Unable to reset server list for recovery group 'ess3500_ess_n1_hs_ess_n2_hs'. mmvdisk: Command failed. Examine previous error messages to determine cause. Well, in the logs we did not find anything And finally we had to delete the rg , because we urgently needed new space With the new one we tested again and we did mmshutdown -startup , and also with --active flag, and all went ok. And now we have data on the rg But we are in concern that this might happen sometimes again and we might not be able to reenable the rg leading to a disaster So if you have any idea I would appreciate very much đ Best regards Walter From: gpfsug-discuss <[email protected]<mailto:[email protected]>> On Behalf Of Jan-Frode Myklebust Sent: Donnerstag, 24. August 2023 14:51 To: gpfsug main discussion list <[email protected]<mailto:[email protected]>> Subject: Re: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned permanently It does sound like "mmvdisk rg change --restart" is the "varyon" command you're looking for.. but it's not clear why it's failing. I would start by looking at if there are any lower level issues with your cluster. Are your nodes healthy on a GPFS-level? "mmnetverify -N all" says network is OK ? "mmhealth node show -N all" not indicating any issues ? Check mmfs.log.latest ? On Thu, Aug 24, 2023 at 1:41âŻPM Walter Sklenka <[email protected]<mailto:[email protected]>> wrote: Hi ! Does someone eventually have experience with ESS 3500 ( no hybrid config, only NLSAS with 5 enclosures ) We have issues with a shared recoverygroup. After creating it we made a test of setting only one node active (mybe not an optimal idea) But since then the recoverygroup is down We have created a PMR but do not get any response until now. The rg has no vdisks of any filesystem [gpfsadmin@hgess02-m ~]$ ^C [gpfsadmin@hgess02-m ~]$ sudo mmvdisk rg change --rg ess3500_hgess02_n1_hs_hgess02_n2_hs --restart mmvdisk: mmvdisk: mmvdisk: Unable to reset server list for recovery group 'ess3500_hgess02_n1_hs_hgess02_n2_hs'. mmvdisk: Command failed. Examine previous error messages to determine cause. We also tried 2023-08-21_16:57:26.174+0200: [I] Command: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid 2023-08-21_16:57:26.201+0200: [I] Recovery group ess3500_hgess02_n1_hs_hgess02_n2_hs has resigned permanently 2023-08-21_16:57:26.201+0200: [E] Command: err 2: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid 2023-08-21_16:57:26.201+0200: Specified entity, such as a disk or file system, does not exist. 2023-08-21_16:57:26.207+0200: [I] Command: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid. 2023-08-21_16:57:26.207+0200: [E] Command: err 212: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid 2023-08-21_16:57:26.207+0200: The current file system manager failed and no new manager will be appointed. This may cause nodes mounting the file system to experience mount failures. 2023-08-21_16:57:26.213+0200: [I] Command: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid 2023-08-21_16:57:26.213+0200: [E] Command: err 212: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid 2023-08-21_16:57:26.213+0200: The current file system manager failed and no new manager will be appointed. This may cause nodes mounting the file system to experience mount failures. For us it is crucial to know what we can do if theis happens again ( it has no vdisks yet so it is not critical ). Do you know: is there a non documented way to âvary onâ, or activate a recoverygroup again? The doc : https://www.ibm.com/docs/en/ess/6.1.6_lts?topic=rgi-recovery-group-issues-shared-recovery-groups-in-ess tells to mmshutdown and mmstartup, but the RGCM does say nothing When trying to execute any vdisk command it only says ârg downâ, no idea how we could recover from that without deleting the rg ( I hope it will never happen, when we have vdisks on it Have a nice day Walter Mit freundlichen GrĂźĂen Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6<https://www.google.com/maps/search/Giefinggasse+6?entry=gmail&source=g>/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: [email protected]<mailto:[email protected]> Internet: www.edv-design.at<http://www.edv-design.at/> _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org<http://gpfsug.org> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org<http://gpfsug.org> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
