I think the idea is that you should not need to know the details of how ccr and sdrserv are implemented nor how they work. At this moment, I don't!
Literally, I just installed GPFS and defined my system with mmcrcluster and so forth and "it just works". As I wrote, just running mmlscluster or mmlsconfig or similar configuration create, list, change, delete commands should start up ccr and sdrserv under the covers. Okay, now "I hear you" -- it ain't working for you today. Presumably it did a while ago? Let's think about that... Troubleshooting 0,1,2 in order of suspicion... 0. Check that you can ping and ssh from each quorum node to every other quorum node. Q*(Q-1) tests 1. Check that you have plenty of free space in /var on each quorum node. Hmmm... we're not talking huge, but see if /var/mmfs/tmp is filled with junk.... Before and After clearing most of that out I had and have: [root@bog-wifi ~]# du -shk /var/mmfs 84532 /var/mmfs ## clean all big and old files out of /var/mmfs/tmp [root@bog-wifi ~]# du -shk /var/mmfs 9004 /var/mmfs Because we know that /var/mmfs is where GPFS store configuration "stuff" - 2. Check that we have GPFS software correctly installed on each quorum node: rpm -qa gpfs.* | xargs rpm --verify From: Bryan Banister <[email protected]> To: gpfsug main discussion list <[email protected]> Date: 07/28/2016 01:58 PM Subject: Re: [gpfsug-discuss] CCR troubles - CCR and mmXXconfig commands fine with mmshutdown Sent by: [email protected] I now see that these mmccrmonitor and mmsdrserv daemons are required for the CCR operations to work. This is just not clear in the error output. Even the GPFS 4.2 Problem Determination Guide doesn’t have anything explaining the “Not enough CCR quorum nodes available” or “Unexpected error from ccr fget mmsdrfs” error messages. Thus there is no clear direction on how to fix this issue from the command output, the man pages, nor the Admin Guides. [root@fpia-gpfs-jcsdr01 ~]# man -E ascii mmccr No manual entry for mmccr There isn’t a help for mmccr either, but at least it does print some usage info: [root@fpia-gpfs-jcsdr01 ~]# mmccr -h Unknown subcommand: '-h'Usage: mmccr subcommand common-options subcommand-options... Subcommands: Setup and Initialization: [snip] I’m still not sure how to start these mmccrmonitor and mmsdrserv daemons without starting GPFS… could you tell me how it would be possible? Thanks for sharing details about how this all works Marc, I do appreciate your response! -Bryan From: [email protected] [ mailto:[email protected]] On Behalf Of Marc A Kaplan Sent: Thursday, July 28, 2016 12:25 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CCR troubles - CCR and mmXXconfig commands fine with mmshutdown Based on experiments on my test cluster, I can assure you that you can list and change GPFS configuration parameters with CCR enabled while GPFS is down. I understand you are having a problem with your cluster, but you are incorrectly disparaging the CCR. In fact you can mmshutdown -a AND kill all GPFS related processes, including mmsdrserv and mmcrmonitor and then issue commands like: mmlscluster, mmlsconfig, mmchconfig Those will work correctly and by-the-way re-start mmsdrserv and mmcrmonitor... (Use command like `ps auxw | grep mm` to find the relevenat processes). But that will not start the main GPFS file manager process mmfsd. GPFS "proper" remains down... For the following commands Linux was "up" on all nodes, but GPFS was shutdown. [root@n2 gpfs-git]# mmgetstate -a Node number Node name GPFS state ------------------------------------------ 1 n2 down 3 n4 down 4 n5 down 6 n3 down However if a majority of the quorum nodes can not be obtained, you WILL see a sequence of messages like this, after a noticeable "timeout": (For the following test I had three quorum nodes and did a Linux shutdown on two of them...) [root@n2 gpfs-git]# mmlsconfig get file failed: Not enough CCR quorum nodes available (err 809) gpfsClusterInit: Unexpected error from ccr fget mmsdrfs. Return code: 158 mmlsconfig: Command failed. Examine previous error messages to determine cause. [root@n2 gpfs-git]# mmchconfig worker1Threads=1022 mmchconfig: Unable to obtain the GPFS configuration file lock. mmchconfig: GPFS was unable to obtain a lock from node n2.frozen. mmchconfig: Command failed. Examine previous error messages to determine cause. [root@n2 gpfs-git]# mmgetstate -a get file failed: Not enough CCR quorum nodes available (err 809) gpfsClusterInit: Unexpected error from ccr fget mmsdrfs. Return code: 158 mmgetstate: Command failed. Examine previous error messages to determine cause. HMMMM.... notice mmgetstate needs a quorum even to "know" what nodes it should check! Then re-starting Linux... So I have two of three quorum nodes active, but GPFS still down... ## From n2, login to node n3 that I just rebooted... [root@n2 gpfs-git]# ssh n3 Last login: Thu Jul 28 09:50:53 2016 from n2.frozen ## See if any mm processes are running? ... NOPE! [root@n3 ~]# ps auxw | grep mm ps auxw | grep mm root 3834 0.0 0.0 112640 972 pts/0 S+ 10:12 0:00 grep --color=auto mm ## Check the state... notice n4 is powered off... [root@n3 ~]# mmgetstate -a mmgetstate -a Node number Node name GPFS state ------------------------------------------ 1 n2 down 3 n4 unknown 4 n5 down 6 n3 down ## Examine the cluster configuration [root@n3 ~]# mmlscluster mmlscluster GPFS cluster information ======================== GPFS cluster name: madagascar.frozen GPFS cluster id: 7399668614468035547 GPFS UID domain: madagascar.frozen Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp Repository type: CCR GPFS cluster configuration servers: ----------------------------------- Primary server: n2.frozen (not in use) Secondary server: n4.frozen (not in use) Node Daemon node name IP address Admin node name Designation ------------------------------------------------------------------- 1 n2.frozen 172.20.0.21 n2.frozen quorum-manager-perfmon 3 n4.frozen 172.20.0.23 n4.frozen quorum-manager-perfmon 4 n5.frozen 172.20.0.24 n5.frozen perfmon 6 n3.frozen 172.20.0.22 n3.frozen quorum-manager-perfmon ## notice that mmccrmonitor and mmsdrserv are running but not mmfsd [root@n3 ~]# ps auxw | grep mm ps auxw | grep mm root 3882 0.0 0.0 114376 1720 pts/0 S 10:13 0:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 root 3954 0.0 0.0 491244 13040 ? Ssl 10:13 0:00 /usr/lpp/mmfs/bin/mmsdrserv 1191 10 10 /var/adm/ras/mmsdrserv.log 128 yes root 4339 0.0 0.0 114376 796 pts/0 S 10:15 0:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 root 4345 0.0 0.0 112640 972 pts/0 S+ 10:16 0:00 grep --color=auto mm ## Now I can mmchconfig ... while GPFS remains down. [root@n3 ~]# mmchconfig worker1Threads=1022 mmchconfig worker1Threads=1022 mmchconfig: Command successfully completed mmchconfig: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. [root@n3 ~]# Thu Jul 28 10:18:16 PDT 2016: mmcommon pushSdr_async: mmsdrfs propagation started Thu Jul 28 10:18:21 PDT 2016: mmcommon pushSdr_async: mmsdrfs propagation completed; mmdsh rc=0 [root@n3 ~]# mmgetstate -a mmgetstate -a Node number Node name GPFS state ------------------------------------------ 1 n2 down 3 n4 unknown 4 n5 down 6 n3 down ## Quorum node n4 remains unreachable... But n2 and n3 are running Linux. [root@n3 ~]# ping -c 1 n4 ping -c 1 n4 PING n4.frozen (172.20.0.23) 56(84) bytes of data. From n3.frozen (172.20.0.22) icmp_seq=1 Destination Host Unreachable --- n4.frozen ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms [root@n3 ~]# exit exit logout Connection to n3 closed. [root@n2 gpfs-git]# ps auwx | grep mm root 3264 0.0 0.0 114376 812 pts/1 S 10:21 0:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 root 3271 0.0 0.0 112640 980 pts/1 S+ 10:21 0:00 grep --color=auto mm root 31820 0.0 0.0 114376 1728 pts/1 S 09:42 0:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 root 32058 0.0 0.0 493264 12000 ? Ssl 09:42 0:00 /usr/lpp/mmfs/bin/mmsdrserv 1191 10 10 /var/adm/ras/mmsdrserv.log 1 root 32263 0.0 0.0 1700732 17600 ? Sl 09:42 0:00 python /usr/lpp/mmfs/bin/mmsysmon.py [root@n2 gpfs-git]# Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
