Re: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned permanently

Walter Sklenka Fri, 25 Aug 2023 03:39:19 -0700

Hi!
Yes, thank you very much
Finally after recreating and yet data on it, we realized we never rebooted the 
IO nodes !!! This is the answer, or at least a calming, feasible  try to 
explain 😉
Have a nice weekend

From: gpfsug-discuss <[email protected]> On Behalf Of Jan-Frode 
Myklebust
Sent: Donnerstag, 24. August 2023 21:56
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned permanently

mmvdisk rg change —active is a very common operation. It should be perfectly 
safe.

mmvdisk rg change —restart is an option I didn’t know about, so likely not 
something that’s commonly used.

I wouldn’t be too worried about losing the RGs. I don’t think that’s something 
that can happen without support being able to help getting it back online. Once 
I’ve had a situation similar to your RG not wanting to become active again 
during an upgrade (around 5 years ago), and I believe we solved it by rebooting 
the io-nodes — must have been some stuck process I was unable to understand… or 
was it a CCR issue caused by some nodes being way back-level..? Don’t remember.

  -jf

tor. 24. aug. 2023 kl. 20:22 skrev Walter Sklenka 
<[email protected]<mailto:[email protected]>>:
Hi Jan-Frode!
We did the “switch” with mmvdisk rg change –rg ess3500_ess_n1_hs_ess_n2_hs 
–active ess-n2-hs             “
Both nodes were up and we did not see any anomalies. And the rg was 
successfully created with the log groups
Maybe the method to switch the rg (with –active) is a bad idea? (because 
manuals says:
https://www.ibm.com/docs/en/ess/6.1.6_lts?topic=command-mmvdisk-recoverygroup
For a shared recovery group, the mmvdisk recoverygroup change --active Node 
command means to make the specified node the server for all four user log 
groups and the root log group. The specified node therefore temporarily becomes 
the sole active server for the entire shared recovery group, leaving the other 
server idle. This should only be done in unusual maintenance situations, since 
it is normally considered an error condition for one of the servers of a shared 
recovery group to be idle. If the keyword DEFAULT is used instead of a server 
name, it restores the normal default balance of log groups, making each of the 
two servers responsible for two user log groups.

this was the state before we tried to restart , no log are seen, we got “unable 
to reset server list”
~]$ sudo mmvdisk server list --rg ess3500_ess_n1_hs_ess_n2_hs

node
number  server                            active   remarks
------  --------------------------------  -------  -------
    98  ess-n1-hs            yes      configured
    99  ess-n2-hs            yes      configured

~]$ sudo mmvdisk recoverygroup list --rg ess3500_ess_n1_hs_ess_n2_hs

             needs    user
recovery group                       node class                                 
  active   current or master server          service  vdisks  remarks
-----------------------------------  ----------  -------  
--------------------------------  -------  ------  -------
ess3500_ess_n1_hs_ess_n2_hs  ess3500_mmvdisk_ess_n1_hs_ess_n2_hs  no       -    
                             unknown       0

~]$ ^C
~]$ sudo mmvdisk rg change --rg ess3500_ess_n1_hs_ess_n2_hs --restart
mmvdisk:
mmvdisk:
mmvdisk: Unable to reset server list for recovery group 
'ess3500_ess_n1_hs_ess_n2_hs'.
mmvdisk: Command failed. Examine previous error messages to determine cause.

Well, in the logs we did not find anything
And finally we had to delete the rg , because we urgently needed new space
With the new one we tested again and  we did mmshutdown -startup , and also 
with --active  flag, and all went ok. And now we have data on the rg
But we are in concern that this might happen sometimes again and we might not 
be able to reenable the rg leading to a disaster

So if you have any idea I would appreciate very much 😊

Best regards
Walter
From: gpfsug-discuss 
<[email protected]<mailto:[email protected]>> 
On Behalf Of Jan-Frode Myklebust
Sent: Donnerstag, 24. August 2023 14:51
To: gpfsug main discussion list 
<[email protected]<mailto:[email protected]>>
Subject: Re: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned permanently

It does sound like "mmvdisk rg change --restart" is the "varyon" command you're 
looking for.. but it's not clear why it's failing. I would start by looking at 
if there are any lower level issues with your cluster. Are your nodes healthy 
on a GPFS-level? "mmnetverify -N all" says network is OK ? "mmhealth node show 
-N all" not indicating any issues ?  Check mmfs.log.latest ?

On Thu, Aug 24, 2023 at 1:41 PM Walter Sklenka 
<[email protected]<mailto:[email protected]>> wrote:

Hi !
Does someone eventually have experience with ESS 3500 ( no hybrid config, only 
NLSAS with 5 enclosures )

We have issues with a shared recoverygroup. After creating it we made a test of 
setting only one node active (mybe not an optimal idea)
But since then the recoverygroup is down
We have created a PMR but do not get any response until now.

The rg has no vdisks of any filesystem
[gpfsadmin@hgess02-m ~]$ ^C
[gpfsadmin@hgess02-m ~]$ sudo mmvdisk rg change --rg 
ess3500_hgess02_n1_hs_hgess02_n2_hs --restart
mmvdisk:
mmvdisk:
mmvdisk: Unable to reset server list for recovery group 
'ess3500_hgess02_n1_hs_hgess02_n2_hs'.
mmvdisk: Command failed. Examine previous error messages to determine cause.

We also tried
2023-08-21_16:57:26.174+0200: [I] Command: tsrecgroupserver 
ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid
2023-08-21_16:57:26.201+0200: [I] Recovery group 
ess3500_hgess02_n1_hs_hgess02_n2_hs has resigned permanently
2023-08-21_16:57:26.201+0200: [E] Command: err 2: tsrecgroupserver 
ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid
2023-08-21_16:57:26.201+0200: Specified entity, such as a disk or file system, 
does not exist.
2023-08-21_16:57:26.207+0200: [I] Command: tsrecgroupserver 
ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid.
2023-08-21_16:57:26.207+0200: [E] Command: err 212: tsrecgroupserver 
ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid
2023-08-21_16:57:26.207+0200: The current file system manager failed and no new 
manager will be appointed. This may cause nodes mounting the file system to 
experience mount failures.
2023-08-21_16:57:26.213+0200: [I] Command: tsrecgroupserver 
ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid
2023-08-21_16:57:26.213+0200: [E] Command: err 212: tsrecgroupserver 
ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid
2023-08-21_16:57:26.213+0200: The current file system manager failed and no new 
manager will be appointed. This may cause nodes mounting the file system to 
experience mount failures.

For us it is crucial to know what we can do if theis happens again  ( it has no 
vdisks yet so it is not critical ).

Do you know: is there a non documented way to “vary on”, or activate a 
recoverygroup again?
The doc :
https://www.ibm.com/docs/en/ess/6.1.6_lts?topic=rgi-recovery-group-issues-shared-recovery-groups-in-ess
tells to mmshutdown and mmstartup, but the RGCM does say nothing
When trying to execute any vdisk command it only says “rg down”, no idea how we 
could recover from that without deleting the rg ( I hope it will never happen, 
when we have vdisks on it

Have a nice day
Walter

Mit freundlichen Grüßen
Walter Sklenka
Technical Consultant

EDV-Design Informationstechnologie GmbH
Giefinggasse 
6<https://www.google.com/maps/search/Giefinggasse+6?entry=gmail&source=g>/1/2, 
A-1210 Wien
Tel: +43 1 29 22 165-31
Fax: +43 1 29 22 165-90
E-Mail: [email protected]<mailto:[email protected]>
Internet: www.edv-design.at<http://www.edv-design.at/>

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org<http://gpfsug.org>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org<http://gpfsug.org>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Re: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned permanently

Reply via email to