Re: [gpfsug-discuss] A cautionary tale of upgrades

Grunenberg, Renar Fri, 11 Jan 2019 07:10:59 -0800

Hallo Simon,
Welcome to the Club. These behavior are a Bug in tsctl to change the DNS names 
. We had this already  4 weeks  ago. The fix was Update to 5.0.2.1.
Regards Renar



Von meinem iPhone gesendet


Renar Grunenberg
Abteilung Informatik - Betrieb

HUK-COBURG
Bahnhofsplatz
96444 Coburg
Telefon:        09561 96-44110
Telefax:        09561 96-44104
E-Mail: [email protected]
Internet:       www.huk.de
________________________________
HUK-COBURG Haftpflicht-Unterstützungs-Kasse kraftfahrender Beamter Deutschlands 
a. G. in Coburg
Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021
Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg
Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin.
Vorstand: Klaus-Jürgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav 
Herøy, Dr. Jörg Rheinländer (stv.), Sarah Rössler, Daniel Thomas.
________________________________
Diese Nachricht enthält vertrauliche und/oder rechtlich geschützte 
Informationen.
Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrtümlich 
erhalten haben,
informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht.
Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist 
nicht gestattet.

This information may contain confidential and/or privileged information.
If you are not the intended recipient (or have received this information in 
error) please notify the
sender immediately and destroy this information.
Any unauthorized copying, disclosure or distribution of the material in this 
information is strictly forbidden.
________________________________

Am 11.01.2019 um 15:19 schrieb Simon Thompson 
<[email protected]<mailto:[email protected]>>:


I’ll start by saying this is our experience, maybe we did something stupid 
along the way, but just in case others see similar issues …

We have a cluster which contains protocol nodes, these were all happily running 
GPFS 5.0.1-2 code. But the cluster was a only 4 nodes + 1 quorum node – manager 
and quorum functions were handled by the 4 protocol nodes.

Then one day we needed to reboot a protocol node. We did so and its disk 
controller appeared to have failed. Oh well, we thought we’ll fix that another 
day, we still have three other quorum nodes.

As they are all getting a little long in the tooth and were starting to 
struggle, we thought, well we have DME, lets add some new nodes for quorum and 
token functions. Being shiny and new they were all installed with GPFS 5.0.2-1 
code.

All was well.

The some-time later, we needed to restart another of the CES nodes, when we 
started GPFS on the node, it was causing havock in our cluster – CES IPs were 
constantly being assigned, then removed from the remaining nodes in the 
cluster. Crap we thought and disabled the node in the cluster. This made things 
stabilise and as we’d been having other GPFS issues, we didn’t want service to 
be interrupted whilst we dug into this. Besides, it was nearly Christmas and we 
had conferences and other work to content with.

More time passes and we’re about to cut over all our backend storage to some 
shiny new DSS-G kit, so we plan a whole system maintenance window. We finish 
all our data sync’s and then try to start our protocol nodes to test them. No 
dice … we can’t get any of the nodes to bring up IPs, the logs look like they 
start the assignment process, but then gave up.

A lot of digging in the mm korn shell scripts, and some studious use of DEBUG=1 
when testing, we find that mmcesnetmvaddress is calling “tsctl shownodes up”. 
On our protocol nodes, we find output of the form:
bear-er-dtn01.bb2.cluster.cluster,rds-aw-ctdb01-data.bb2.cluster.cluster,rds-er-ctdb01-data.bb2.cluster.cluster,bber-irods-ires01-data.bb2.cluster.cluster,bber-irods-icat01-data.bb2.cluster.cluster,bbaw-irods-icat01-data.bb2.cluster.cluster,proto-pg-mgr01.bear.cluster.cluster,proto-pg-pf01.bear.cluster.cluster,proto-pg-dtn01.bear.cluster.cluster,proto-er-mgr01.bear.cluster.cluster,proto-er-pf01.bear.cluster.cluster,proto-aw-mgr01.bear.cluster.cluster,proto-aw-pf01.bear.cluster.cluster

Now our DNS name for these nodes is bb2.cluster … something is repeating the 
DNS name.

So we dig around, resolv.conf, /etc/hosts etc all look good and name resolution 
seems fine.

We look around on the manager/quorum nodes and they don’t do this 
cluster.cluster thing. We can’t find anything else Linux config wise that looks 
bad. In fact the only difference is that our CES nodes are running 5.0.1-2 and 
the manager nodes 5.0.2-1. Given we’re changing the whole storage hardware, we 
didn’t want to change the GPFS/NFS/SMB code on the CES nodes, (we’ve been 
bitten before with SMB packages not working properly in our environment), but 
we go ahead and do GPFS and NFS packages.

Suddenly, magically all is working again. CES starts fine and IPs get assigned 
OK. And tsctl gives the correct output.

So, my supposition is that there is some incompatibility between 5.0.1-2 and 
5.0.2-1 when running CES and the cluster manager is running on 5.0.2-1. As I 
said before, I don’t have hard evidence we did something stupid, but it 
certainly is fishy. We’re guessing this same “feature” was the cause of the CES 
issues we saw when we rebooted a CES node and the IPs kept deassigning… It 
looks like all was well as we added the manager nodes after CES was started, 
but when a CES node restarted, things broke.

We got everything working again in house so didn’t raise a PMR, but if you find 
yourself in this upgrade path, beware!

Simon


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] A cautionary tale of upgrades

Reply via email to