they should get started as soon as you shutdown via mmshutdown could you check a node where the processes are NOT started and simply run mmshutdown on this node to see if they get started ?
On Thu, Jul 28, 2016 at 10:57 AM, Bryan Banister <[email protected]> wrote: > I now see that these mmccrmonitor and mmsdrserv daemons are required for > the CCR operations to work. This is just not clear in the error output. > Even the GPFS 4.2 Problem Determination Guide doesn’t have anything > explaining the “Not enough CCR quorum nodes available” or “Unexpected error > from ccr fget mmsdrfs” error messages. Thus there is no clear direction on > how to fix this issue from the command output, the man pages, nor the Admin > Guides. > > > > [root@fpia-gpfs-jcsdr01 ~]# man -E ascii mmccr > > No manual entry for mmccr > > > > There isn’t a help for mmccr either, but at least it does print some usage > info: > > > > [root@fpia-gpfs-jcsdr01 ~]# mmccr -h > > Unknown subcommand: '-h'Usage: mmccr subcommand common-options > subcommand-options... > > > > Subcommands: > > > > Setup and Initialization: > > [snip] > > > > I’m still not sure how to start these mmccrmonitor and mmsdrserv daemons > without starting GPFS… could you tell me how it would be possible? > > > > Thanks for sharing details about how this all works Marc, I do appreciate > your response! > > -Bryan > > > > *From:* [email protected] [mailto: > [email protected]] *On Behalf Of *Marc A Kaplan > *Sent:* Thursday, July 28, 2016 12:25 PM > *To:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] CCR troubles - CCR and mmXXconfig > commands fine with mmshutdown > > > > Based on experiments on my test cluster, I can assure you that you can > list and change GPFS configuration parameters with CCR enabled while GPFS > is down. > > I understand you are having a problem with your cluster, but you are > incorrectly disparaging the CCR. > > In fact you can mmshutdown -a AND kill all GPFS related processes, > including mmsdrserv and mmcrmonitor and then issue commands like: > > mmlscluster, mmlsconfig, mmchconfig > > Those will work correctly and by-the-way re-start mmsdrserv and > mmcrmonitor... > (Use command like `ps auxw | grep mm` to find the relevenat processes). > > But that will not start the main GPFS file manager process mmfsd. GPFS > "proper" remains down... > > For the following commands Linux was "up" on all nodes, but GPFS was > shutdown. > [root@n2 gpfs-git]# mmgetstate -a > > Node number Node name GPFS state > ------------------------------------------ > 1 n2 down > 3 n4 down > 4 n5 down > 6 n3 down > > However if a majority of the quorum nodes can not be obtained, you WILL > see a sequence of messages like this, after a noticeable "timeout": > (For the following test I had three quorum nodes and did a Linux shutdown > on two of them...) > > [root@n2 gpfs-git]# mmlsconfig > get file failed: Not enough CCR quorum nodes available (err 809) > gpfsClusterInit: Unexpected error from ccr fget mmsdrfs. Return code: 158 > mmlsconfig: Command failed. Examine previous error messages to determine > cause. > > [root@n2 gpfs-git]# mmchconfig worker1Threads=1022 > mmchconfig: Unable to obtain the GPFS configuration file lock. > mmchconfig: GPFS was unable to obtain a lock from node n2.frozen. > mmchconfig: Command failed. Examine previous error messages to determine > cause. > > [root@n2 gpfs-git]# mmgetstate -a > get file failed: Not enough CCR quorum nodes available (err 809) > gpfsClusterInit: Unexpected error from ccr fget mmsdrfs. Return code: 158 > mmgetstate: Command failed. Examine previous error messages to determine > cause. > > HMMMM.... notice mmgetstate needs a quorum even to "know" what nodes it > should check! > > Then re-starting Linux... So I have two of three quorum nodes active, but > GPFS still down... > > ## From n2, login to node n3 that I just rebooted... > [root@n2 gpfs-git]# ssh n3 > Last login: Thu Jul 28 09:50:53 2016 from n2.frozen > > ## See if any mm processes are running? ... NOPE! > > [root@n3 ~]# ps auxw | grep mm > ps auxw | grep mm > root 3834 0.0 0.0 112640 972 pts/0 S+ 10:12 0:00 grep > --color=auto mm > > ## Check the state... notice n4 is powered off... > [root@n3 ~]# mmgetstate -a > mmgetstate -a > > Node number Node name GPFS state > ------------------------------------------ > 1 n2 down > 3 n4 unknown > 4 n5 down > 6 n3 down > > ## Examine the cluster configuration > [root@n3 ~]# mmlscluster > mmlscluster > > GPFS cluster information > ======================== > GPFS cluster name: madagascar.frozen > GPFS cluster id: 7399668614468035547 > GPFS UID domain: madagascar.frozen > Remote shell command: /usr/bin/ssh > Remote file copy command: /usr/bin/scp > Repository type: CCR > > GPFS cluster configuration servers: > ----------------------------------- > Primary server: n2.frozen (not in use) > Secondary server: n4.frozen (not in use) > > Node Daemon node name IP address Admin node name Designation > ------------------------------------------------------------------- > 1 n2.frozen 172.20.0.21 n2.frozen > quorum-manager-perfmon > 3 n4.frozen 172.20.0.23 n4.frozen > quorum-manager-perfmon > 4 n5.frozen 172.20.0.24 n5.frozen perfmon > 6 n3.frozen 172.20.0.22 n3.frozen > quorum-manager-perfmon > > ## notice that mmccrmonitor and mmsdrserv are running but not mmfsd > > [root@n3 ~]# ps auxw | grep mm > ps auxw | grep mm > root 3882 0.0 0.0 114376 1720 pts/0 S 10:13 0:00 > /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 > root 3954 0.0 0.0 491244 13040 ? Ssl 10:13 0:00 > /usr/lpp/mmfs/bin/mmsdrserv 1191 10 10 /var/adm/ras/mmsdrserv.log 128 yes > root 4339 0.0 0.0 114376 796 pts/0 S 10:15 0:00 > /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 > root 4345 0.0 0.0 112640 972 pts/0 S+ 10:16 0:00 grep > --color=auto mm > > ## Now I can mmchconfig ... while GPFS remains down. > > [root@n3 ~]# mmchconfig worker1Threads=1022 > mmchconfig worker1Threads=1022 > mmchconfig: Command successfully completed > mmchconfig: Propagating the cluster configuration data to all > affected nodes. This is an asynchronous process. > [root@n3 ~]# Thu Jul 28 10:18:16 PDT 2016: mmcommon pushSdr_async: > mmsdrfs propagation started > Thu Jul 28 10:18:21 PDT 2016: mmcommon pushSdr_async: mmsdrfs propagation > completed; mmdsh rc=0 > > [root@n3 ~]# mmgetstate -a > mmgetstate -a > > Node number Node name GPFS state > ------------------------------------------ > 1 n2 down > 3 n4 unknown > 4 n5 down > 6 n3 down > > ## Quorum node n4 remains unreachable... But n2 and n3 are running Linux. > [root@n3 ~]# ping -c 1 n4 > ping -c 1 n4 > PING n4.frozen (172.20.0.23) 56(84) bytes of data. > From n3.frozen (172.20.0.22) icmp_seq=1 Destination Host Unreachable > > --- n4.frozen ping statistics --- > 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms > > [root@n3 ~]# exit > exit > logout > Connection to n3 closed. > [root@n2 gpfs-git]# ps auwx | grep mm > root 3264 0.0 0.0 114376 812 pts/1 S 10:21 0:00 > /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 > root 3271 0.0 0.0 112640 980 pts/1 S+ 10:21 0:00 grep > --color=auto mm > root 31820 0.0 0.0 114376 1728 pts/1 S 09:42 0:00 > /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 > root 32058 0.0 0.0 493264 12000 ? Ssl 09:42 0:00 > /usr/lpp/mmfs/bin/mmsdrserv 1191 10 10 /var/adm/ras/mmsdrserv.log 1 > root 32263 0.0 0.0 1700732 17600 ? Sl 09:42 0:00 python > /usr/lpp/mmfs/bin/mmsysmon.py > [root@n2 gpfs-git]# > > > ------------------------------ > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential or privileged information. > If you are not the intended recipient, you are hereby notified that any > review, dissemination or copying of this email is strictly prohibited, and > to please notify the sender immediately and destroy this email and any > attachments. Email transmission cannot be guaranteed to be secure or > error-free. The Company, therefore, does not make any guarantees as to the > completeness or accuracy of this email or any attachments. This email is > for informational purposes only and does not constitute a recommendation, > offer, request or solicitation of any kind to buy, sell, subscribe, redeem > or perform any type of transaction of a financial product. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
