----- Forwarded by Eric Agar/Poughkeepsie/IBM on 09/13/2017 05:32 PM ----- From: IBM Spectrum Scale/Poughkeepsie/IBM To: Michal Zacek <[email protected]> Date: 09/13/2017 05:29 PM Subject: Re: [gpfsug-discuss] Wrong nodename after server restart Sent by: Eric Agar
Hello Michal, It should not be necessary to delete whale.img.cas.cz and rename it. But, that is an option you can take, if you prefer it. If you decide to take that option, please see the last paragraph of this response. The confusion starts at the moment a node is added to the active cluster where the new node does not have the same common domain suffix as the nodes that were already in the cluster. The confusion increases when the GPFS daemons on some nodes, but not all nodes, are recycled. Doing mmshutdown -a, followed by mmstartup -a, once after the new node has been added allows all GPFS daemons on all nodes to come up at the same time and arrive at the same answer to the question, "what is the common domain suffix for all the nodes in the cluster now?" In the case of your cluster, the answer will be "the common domain suffix is the empty string" or, put another way, "there is no common domain suffix"; that is okay, as long as all the GPFS daemons come to the same conclusion. After you recycle the cluster, you can check to make sure all seems well by running "tsctl shownodes up" on every node, and make sure the answer is correct on each node. If the mmshutdown -a / mmstartup -a recycle works, the problem should not recur with the current set of nodes in the cluster. Even as individual GPFS daemons are recycled going forward, they should still understand the cluster's nodes have no common domain suffix. However, I can imagine sequences of events that would cause the issue to occur again after nodes are deleted or added to the cluster while the cluster is active. For example, if whale.img.cas.cz were to be deleted from the current cluster, that action would restore the cluster to having a common domain suffix of ".img.local", but already running GPFS daemons would not realize it. If the delete of whale occurred while the cluster was active, subsequent recycling of the GPFS daemon on just a subset of the nodes would cause the recycled daemons to understand the common domain suffix to now be ".img.local". But, daemons that had not been recycled would still think there is no common domain suffix. The confusion would occur again. On the other hand, adding and deleting nodes to/from the cluster should not cause the issue to occur again as long as the cluster continues to have the same (in this case, no) common domain suffix. If you decide to delete whale.img.case.cz, rename it to have the ".img.local" domain suffix, and add it back to the cluster, it would be best to do so after all the GPFS daemons are shut down with mmshutdown -a, but before any of the daemons are restarted with mmstartup. This would allow all the subsequent running daemons to come to the conclusion that ".img.local" is now the common domain suffix. I hope this helps. Regards, Eric Agar Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Michal Zacek <[email protected]> To: IBM Spectrum Scale <[email protected]> Date: 09/13/2017 03:42 AM Subject: Re: [gpfsug-discuss] Wrong nodename after server restart Hello yes you are correct, Whale was added two days a go. It's necessary to delete whale.img.cas.cz from cluster before mmshutdown/mmstartup? If the two domains may cause problems in the future I can rename whale (and all planed nodes) to img.local suffix. Many thanks for the prompt reply. Regards Michal Dne 12.9.2017 v 17:01 IBM Spectrum Scale napsal(a): Michal, When a node is added to a cluster that has a different domain than the rest of the nodes in the cluster, the GPFS daemons running on the various nodes can develop an inconsistent understanding of what the common suffix of all the domain names are. The symptoms you show with the "tsctl shownodes up" output, and in particular the incorrect node names of the two nodes you restarted, as seen on a node you did not restart, are consistent with this problem. I also note your cluster appears to have the necessary pre-condition to trip on this problem, whale.img.cas.cz does not share a common suffix with the other nodes in the cluster. The common suffix of the other nodes in the cluster is ".img.local". Was whale.img.cas.cz recently added to the cluster? Unfortunately, the general work-around is to recycle all the nodes at once: mmshutdown -a, followed by mmstartup -a. I hope this helps. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Michal Zacek <[email protected]> To: [email protected] Date: 09/12/2017 05:41 AM Subject: [gpfsug-discuss] Wrong nodename after server restart Sent by: [email protected] Hi, I had to restart two of my gpfs servers (gpfs-n4 and gpfs-quorum) and after that I was unable to move CES IP address back with strange error "mmces address move: GPFS is down on this node". After I double checked that gpfs state is active on all nodes, I dug deeper and I think I found problem, but I don't really know how this could happen. Look at the names of nodes: [root@gpfs-n2 ~]# mmlscluster # Looks good GPFS cluster information ======================== GPFS cluster name: gpfscl1.img.local GPFS cluster id: 17792677515884116443 GPFS UID domain: img.local Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp Repository type: CCR Node Daemon node name IP address Admin node name Designation ---------------------------------------------------------------------------------- 1 gpfs-n4.img.local 192.168.20.64 gpfs-n4.img.local quorum-manager 2 gpfs-quorum.img.local 192.168.20.60 gpfs-quorum.img.local quorum 3 gpfs-n3.img.local 192.168.20.63 gpfs-n3.img.local quorum-manager 4 tau.img.local 192.168.1.248 tau.img.local 5 gpfs-n1.img.local 192.168.20.61 gpfs-n1.img.local quorum-manager 6 gpfs-n2.img.local 192.168.20.62 gpfs-n2.img.local quorum-manager 8 whale.img.cas.cz 147.231.150.108 whale.img.cas.cz [root@gpfs-n2 ~]# mmlsmount gpfs01 -L # not so good File system gpfs01 is mounted on 7 nodes: 192.168.20.63 gpfs-n3 192.168.20.61 gpfs-n1 192.168.20.62 gpfs-n2 192.168.1.248 tau 192.168.20.64 gpfs-n4.img.local 192.168.20.60 gpfs-quorum.img.local 147.231.150.108 whale.img.cas.cz [root@gpfs-n2 ~]# tsctl shownodes up | tr ',' '\n' # very wrong whale.img.cas.cz.img.local tau.img.local gpfs-quorum.img.local.img.local gpfs-n1.img.local gpfs-n2.img.local gpfs-n3.img.local gpfs-n4.img.local.img.local The "tsctl shownodes up" is the reason why I'm not able to move CES address back to gpfs-n4 node, but the real problem are different nodenames. I think OS is configured correctly: [root@gpfs-n4 /]# hostname gpfs-n4 [root@gpfs-n4 /]# hostname -f gpfs-n4.img.local [root@gpfs-n4 /]# cat /etc/resolv.conf nameserver 192.168.20.30 nameserver 147.231.150.2 search img.local domain img.local [root@gpfs-n4 /]# cat /etc/hosts | grep gpfs-n4 192.168.20.64 gpfs-n4.img.local gpfs-n4 [root@gpfs-n4 /]# host gpfs-n4 gpfs-n4.img.local has address 192.168.20.64 [root@gpfs-n4 /]# host 192.168.20.64 64.20.168.192.in-addr.arpa domain name pointer gpfs-n4.img.local. Can someone help me with this. Thanks, Michal p.s. gpfs version: 4.2.3-2 (CentOS 7) _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=l_sz-tPolX87WmSf2zBhhPpggnfQJKp7-BqV8euBp7A&s=XSPGkKRMza8PhYQg8AxeKW9cOTNeCI9uph486_6Xajo&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Michal Žáček | Information Technologies +420 296 443 128 +420 296 443 333 [email protected] www.img.cas.cz Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20 Prague 4, Czech Republic ID: 68378050 | VAT ID: CZ68378050
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
