Hi there,
I've seen the on several 'stock'? 'core'? GPFS system (we need a
better term now GSS is out) and seen ping 'working', but alongside
ejections from the cluster.
The GPFS internode 'ping' is somewhat more circumspect than unix ping -
and rightly so.
In my experience this has _always_ been a network issue of one sort of
another. If the network is experiencing issues, nodes will be ejected.
Of course it could be unresponsive mmfsd or high loadavg, but I've seen
that only twice in 10 years over many versions of GPFS.
You need to follow the logs through from each machine in time order to
determine who could not see who and in what order.
Your best way forward is to log a SEV2 case with IBM support, directly
or via your OEM and collect and supply a snap and traces as required by
support.
Without knowing your full setup, it's hard to help further.
Jez
On 20/08/14 08:57, Salvatore Di Nardo wrote:
Still problems. Here some more detailed examples:
*EXAMPLE 1:*
*EBI5-220**( CLIENT)**
*Tue Aug 19 11:03:04.980 2014: *Timed out waiting for a
reply from node <GSS02B IP> gss02b*
Tue Aug 19 11:03:04.981 2014: Request sent to <GSS02A IP>
(gss02a in GSS.ebi.ac.uk) to expel <GSS02B IP> (gss02b in
GSS.ebi.ac.uk) from cluster GSS.ebi.ac.uk
Tue Aug 19 11:03:04.982 2014: This node will be expelled
from cluster GSS.ebi.ac.uk due to expel msg from <EBI5-220
IP> (ebi5-220)
Tue Aug 19 11:03:09.319 2014: Cluster Manager connection
broke. Probing cluster GSS.ebi.ac.uk
Tue Aug 19 11:03:10.321 2014: Unable to contact any quorum
nodes during cluster probe.
Tue Aug 19 11:03:10.322 2014: Lost membership in cluster
GSS.ebi.ac.uk. Unmounting file systems.
Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount
invoked. File system: gpfs1 Reason: SGPanic
Tue Aug 19 11:03:12.066 2014: Connecting to <GSS02A IP>
gss02a <c1p687>
Tue Aug 19 11:03:12.070 2014: Connected to <GSS02A IP>
gss02a <c1p687>
Tue Aug 19 11:03:17.071 2014: Connecting to <GSS02B IP>
gss02b <c1p686>
Tue Aug 19 11:03:17.072 2014: Connecting to <GSS03B IP>
gss03b <c1p685>
Tue Aug 19 11:03:17.079 2014: Connecting to <GSS03A IP>
gss03a <c1p684>
Tue Aug 19 11:03:17.080 2014: Connecting to <GSS01B IP>
gss01b <c1p683>
Tue Aug 19 11:03:17.079 2014: Connecting to <GSS01A IP>
gss01a <c1p1>
Tue Aug 19 11:04:23.105 2014: Connected to <GSS02B IP>
gss02b <c1p686>
Tue Aug 19 11:04:23.107 2014: Connected to <GSS03B IP>
gss03b <c1p685>
Tue Aug 19 11:04:23.112 2014: Connected to <GSS03A IP>
gss03a <c1p684>
Tue Aug 19 11:04:23.115 2014: Connected to <GSS01B IP>
gss01b <c1p683>
Tue Aug 19 11:04:23.121 2014: Connected to <GSS01A IP>
gss01a <c1p1>
Tue Aug 19 11:12:28.992 2014: Node <GSS02A IP> (gss02a in
GSS.ebi.ac.uk) is now the Group Leader.
*GSS02B ( NSD SERVER)*
...
Tue Aug 19 11:03:17.070 2014: Killing connection from
*<EBI5-220 IP>* because the group is not ready for it to
rejoin, err 46
Tue Aug 19 11:03:25.016 2014: Killing connection from
<EBI5-102 IP> because the group is not ready for it to
rejoin, err 46
Tue Aug 19 11:03:28.080 2014: Killing connection from
*<EBI5-220 IP>* because the group is not ready for it to
rejoin, err 46
Tue Aug 19 11:03:36.019 2014: Killing connection from
<EBI5-102 IP> because the group is not ready for it to
rejoin, err 46
Tue Aug 19 11:03:39.083 2014: Killing connection from
*<EBI5-220 IP>* because the group is not ready for it to
rejoin, err 46
Tue Aug 19 11:03:47.023 2014: Killing connection from
<EBI5-102 IP> because the group is not ready for it to
rejoin, err 46
Tue Aug 19 11:03:50.088 2014: Killing connection from
*<EBI5-220 IP>* because the group is not ready for it to
rejoin, err 46
Tue Aug 19 11:03:52.218 2014: Killing connection from
<EBI5-043 IP> because the group is not ready for it to
rejoin, err 46
Tue Aug 19 11:03:58.030 2014: Killing connection from
<EBI5-102 IP> because the group is not ready for it to
rejoin, err 46
Tue Aug 19 11:04:01.092 2014: Killing connection from
*<EBI5-220 IP>* because the group is not ready for it to
rejoin, err 46
Tue Aug 19 11:04:03.220 2014: Killing connection from
<EBI5-043 IP> because the group is not ready for it to
rejoin, err 46
Tue Aug 19 11:04:09.034 2014: Killing connection from
<EBI5-102 IP> because the group is not ready for it to
rejoin, err 46
Tue Aug 19 11:04:12.096 2014: Killing connection from
*<EBI5-220 IP>* because the group is not ready for it to
rejoin, err 46
Tue Aug 19 11:04:14.224 2014: Killing connection from
<EBI5-043 IP> because the group is not ready for it to
rejoin, err 46
Tue Aug 19 11:04:20.037 2014: Killing connection from
<EBI5-102 IP> because the group is not ready for it to
rejoin, err 46
Tue Aug 19 11:04:23.103 2014: Accepted and connected to
*<EBI5-220 IP>* ebi5-220 <c0n618>
...
*GSS02a ( NSD SERVER)*
Tue Aug 19 11:03:04.980 2014: Expel <GSS02B IP> (gss02b)
request from <EBI5-220 IP> (ebi5-220 in
ebi-cluster.ebi.ac.uk). Expelling: <EBI5-220 IP> (ebi5-220
in ebi-cluster.ebi.ac.uk)
Tue Aug 19 11:03:12.069 2014: Accepted and connected to
<EBI5-220 IP> ebi5-220 <c0n618>
===============================================
*EXAMPLE 2*:
*EBI5-038*
Tue Aug 19 11:32:34.227 2014: *Disk lease period expired
in cluster GSS.ebi.ac.uk. Attempting to reacquire lease.*
Tue Aug 19 11:33:34.258 2014: *Lease is overdue. Probing
cluster GSS.ebi.ac.uk*
Tue Aug 19 11:35:24.265 2014: Close connection to <GSS02A
IP> gss02a <c1n2> (Connection reset by peer). Attempting
reconnect.
Tue Aug 19 11:35:24.865 2014: Close connection to
<EBI5-014 IP> ebi5-014 <c1n457> (Connection reset by
peer). Attempting reconnect.
...
LOT MORE RESETS BY PEER
...
Tue Aug 19 11:35:25.096 2014: Close connection to
<EBI5-167 IP> ebi5-167 <c1n155> (Connection reset by
peer). Attempting reconnect.
Tue Aug 19 11:35:25.267 2014: Connecting to <GSS02A IP>
gss02a <c1n2>
Tue Aug 19 11:35:25.268 2014: Close connection to <GSS02A
IP> gss02a <c1n2> (Connection failed because destination
is still processing previous node failure)
Tue Aug 19 11:35:26.267 2014: Retry connection to <GSS02A
IP> gss02a <c1n2>
Tue Aug 19 11:35:26.268 2014: Close connection to <GSS02A
IP> gss02a <c1n2> (Connection failed because destination
is still processing previous node failure)
Tue Aug 19 11:36:24.276 2014: Unable to contact any quorum
nodes during cluster probe.
Tue Aug 19 11:36:24.277 2014: *Lost membership in cluster
GSS.ebi.ac.uk. Unmounting file systems.*
*GSS02a*
Tue Aug 19 11:35:24.263 2014: Node <EBI5-038 IP> (ebi5-038
in ebi-cluster.ebi.ac.uk) *is being expelled because of an
expired lease.* Pings sent: 60. Replies received: 60.
In example 1 seems that an NSD was not repliyng to the client, but the
servers seems working fine.. how can i trace better ( to solve) the
problem?
In example 2 it seems to me that for some reason the manager are not
renewing the lease in time. when this happens , its not a single client.
Loads of them fail to get the lease renewed. Why this is happening?
how can i trace to the source of the problem?
Thanks in advance for any tips.
Regards,
Salvatore
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss