Hi Marc,

So this issue is actually caused by our Systemd setup.  We have fully converted 
over to Systemd to manage the dependency chain needed for GPFS to start 
properly and also our scheduling system after that.  The issue is that when we 
shutdown GPFS with Systemd this apparently is causing the mmsdrserv and 
mmccrmonitor processes to also be killed/term'd, probably because these are 
started in the same CGROUP as GPFS and Systemd kills all processes in this 
CGROUP when GPFS is stopped.

Not sure how to proceed with safeguarding these daemons from Systemd... and 
real Systemd support in GPFS is basically non-existent at this point.

So my problem is actually a Systemd problem, not a CCR problem!
-Bryan

From: [email protected] 
[mailto:[email protected]] On Behalf Of Bryan Banister
Sent: Thursday, July 28, 2016 12:58 PM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] CCR troubles - CCR and mmXXconfig commands fine 
with mmshutdown

I now see that these mmccrmonitor and mmsdrserv daemons are required for the 
CCR operations to work.  This is just not clear in the error output.  Even the 
GPFS 4.2 Problem Determination Guide doesn't have anything explaining the "Not 
enough CCR quorum nodes available" or "Unexpected error from ccr fget mmsdrfs" 
error messages.  Thus there is no clear direction on how to fix this issue from 
the command output, the man pages, nor the Admin Guides.

[root@fpia-gpfs-jcsdr01 ~]# man -E ascii mmccr
No manual entry for mmccr

There isn't a help for mmccr either, but at least it does print some usage info:

[root@fpia-gpfs-jcsdr01 ~]# mmccr -h
Unknown subcommand: '-h'Usage: mmccr subcommand common-options 
subcommand-options...

Subcommands:

Setup and Initialization:
[snip]

I'm still not sure how to start these mmccrmonitor and mmsdrserv daemons 
without starting GPFS... could you tell me how it would be possible?

Thanks for sharing details about how this all works Marc, I do appreciate your 
response!
-Bryan

From: 
[email protected]<mailto:[email protected]>
 [mailto:[email protected]] On Behalf Of Marc A Kaplan
Sent: Thursday, July 28, 2016 12:25 PM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] CCR troubles - CCR and mmXXconfig commands fine 
with mmshutdown

Based on experiments on my test cluster, I can assure you that you can list and 
change GPFS configuration parameters with CCR enabled while GPFS is down.

I understand you are having a problem with your cluster, but you are 
incorrectly disparaging the CCR.

In fact you can mmshutdown -a AND kill all GPFS related processes, including 
mmsdrserv and mmcrmonitor and then issue commands like:

mmlscluster, mmlsconfig, mmchconfig

Those will work correctly and by-the-way re-start mmsdrserv and mmcrmonitor...
(Use command like `ps auxw | grep mm`  to find the relevenat processes).

But that will not start the main GPFS file manager process mmfsd.  GPFS 
"proper" remains down...

For the following commands Linux was "up" on all nodes, but GPFS was shutdown.
[root@n2 gpfs-git]# mmgetstate -a

 Node number  Node name        GPFS state
------------------------------------------
       1      n2               down
       3      n4               down
       4      n5               down
       6      n3               down

However if a majority of the quorum nodes can not be obtained, you WILL see a 
sequence of messages like this, after a noticeable "timeout":
(For the following test I had three quorum nodes and did a Linux shutdown on 
two of them...)

[root@n2 gpfs-git]# mmlsconfig
get file failed: Not enough CCR quorum nodes available (err 809)
gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158
mmlsconfig: Command failed. Examine previous error messages to determine cause.

[root@n2 gpfs-git]# mmchconfig worker1Threads=1022
mmchconfig: Unable to obtain the GPFS configuration file lock.
mmchconfig: GPFS was unable to obtain a lock from node n2.frozen.
mmchconfig: Command failed. Examine previous error messages to determine cause.

[root@n2 gpfs-git]# mmgetstate -a
get file failed: Not enough CCR quorum nodes available (err 809)
gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158
mmgetstate: Command failed. Examine previous error messages to determine cause.

HMMMM.... notice mmgetstate needs a quorum even to "know" what nodes it should 
check!

Then re-starting Linux... So I have two of three quorum nodes active, but GPFS 
still down...

##  From n2, login to node n3 that I just rebooted...
[root@n2 gpfs-git]# ssh n3
Last login: Thu Jul 28 09:50:53 2016 from n2.frozen

## See if any mm processes are running? ... NOPE!

[root@n3 ~]# ps auxw | grep mm
ps auxw | grep mm
root      3834  0.0  0.0 112640   972 pts/0    S+   10:12   0:00 grep 
--color=auto mm

## Check the state...  notice n4 is powered off...
[root@n3 ~]# mmgetstate -a
mmgetstate -a

 Node number  Node name        GPFS state
------------------------------------------
       1      n2               down
       3      n4               unknown
       4      n5               down
       6      n3               down

## Examine the cluster configuration
[root@n3 ~]# mmlscluster
mmlscluster

GPFS cluster information
========================
  GPFS cluster name:         madagascar.frozen
  GPFS cluster id:           7399668614468035547
  GPFS UID domain:           madagascar.frozen
  Remote shell command:      /usr/bin/ssh
  Remote file copy command:  /usr/bin/scp
  Repository type:           CCR

GPFS cluster configuration servers:
-----------------------------------
  Primary server:    n2.frozen (not in use)
  Secondary server:  n4.frozen (not in use)

 Node  Daemon node name  IP address   Admin node name  Designation
-------------------------------------------------------------------
   1   n2.frozen         172.20.0.21  n2.frozen        quorum-manager-perfmon
   3   n4.frozen         172.20.0.23  n4.frozen        quorum-manager-perfmon
   4   n5.frozen         172.20.0.24  n5.frozen        perfmon
   6   n3.frozen         172.20.0.22  n3.frozen        quorum-manager-perfmon

## notice that mmccrmonitor and mmsdrserv are running but not mmfsd

[root@n3 ~]# ps auxw | grep mm
ps auxw | grep mm
root      3882  0.0  0.0 114376  1720 pts/0    S    10:13   0:00 
/usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15
root      3954  0.0  0.0 491244 13040 ?        Ssl  10:13   0:00 
/usr/lpp/mmfs/bin/mmsdrserv 1191 10 10 /var/adm/ras/mmsdrserv.log 128 yes
root      4339  0.0  0.0 114376   796 pts/0    S    10:15   0:00 
/usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15
root      4345  0.0  0.0 112640   972 pts/0    S+   10:16   0:00 grep 
--color=auto mm

## Now I can mmchconfig ... while GPFS remains down.

[root@n3 ~]# mmchconfig worker1Threads=1022
mmchconfig worker1Threads=1022
mmchconfig: Command successfully completed
mmchconfig: Propagating the cluster configuration data to all
  affected nodes.  This is an asynchronous process.
[root@n3 ~]# Thu Jul 28 10:18:16 PDT 2016: mmcommon pushSdr_async: mmsdrfs 
propagation started
Thu Jul 28 10:18:21 PDT 2016: mmcommon pushSdr_async: mmsdrfs propagation 
completed; mmdsh rc=0

[root@n3 ~]# mmgetstate -a
mmgetstate -a

 Node number  Node name        GPFS state
------------------------------------------
       1      n2               down
       3      n4               unknown
       4      n5               down
       6      n3               down

## Quorum node n4 remains unreachable...  But n2 and n3 are running Linux.
[root@n3 ~]# ping -c 1 n4
ping -c 1 n4
PING n4.frozen (172.20.0.23) 56(84) bytes of data.
>From n3.frozen (172.20.0.22) icmp_seq=1 Destination Host Unreachable

--- n4.frozen ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

[root@n3 ~]# exit
exit
logout
Connection to n3 closed.
[root@n2 gpfs-git]# ps auwx | grep mm
root      3264  0.0  0.0 114376   812 pts/1    S    10:21   0:00 
/usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15
root      3271  0.0  0.0 112640   980 pts/1    S+   10:21   0:00 grep 
--color=auto mm
root     31820  0.0  0.0 114376  1728 pts/1    S    09:42   0:00 
/usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15
root     32058  0.0  0.0 493264 12000 ?        Ssl  09:42   0:00 
/usr/lpp/mmfs/bin/mmsdrserv 1191 10 10 /var/adm/ras/mmsdrserv.log 1
root     32263  0.0  0.0 1700732 17600 ?       Sl   09:42   0:00 python 
/usr/lpp/mmfs/bin/mmsysmon.py
[root@n2 gpfs-git]#

________________________________

Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.

________________________________

Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to