We’ve recently been seeing quite a few node expels with messages of the form:

2019-01-17_11:19:30.882+0000: [W] The TCP connection to IP address 10.20.0.58 
proto-pg-pf01.bear.cluster <c0n236> (socket 153) state is unexpected: state=1 
ca_state=4 snd_cwnd=1 snd_ssthresh=5 unacked=5 probes=0 backoff=7 retransmits=7 
rto=26496000 rcv_ssthresh=102828 rtt=6729 rttvar=12066 sacked=0 retrans=1 
reordering=3 lost=5
2019-01-17_11:19:30.882+0000: [I] tscCheckTcpConn: Sending debug data 
collection request to node 10.20.0.58 proto-pg-pf01.bear.cluster
2019-01-17_11:19:30.882+0000: Sending request to collect TCP debug data to 
proto-pg-pf01.bear.cluster localNode
2019-01-17_11:19:30.882+0000: [I] Calling user exit script 
gpfsSendRequestToNodes: event sendRequestToNodes, Async command 
/usr/lpp/mmfs/bin/mmcommon.
2019-01-17_11:24:52.611+0000: [E] Timed out in 300 seconds waiting for a 
commMsgCheckMessages reply from node 10.20.0.58 proto-pg-pf01.bear.cluster. 
Sending expel message.

On the client node, we see messages of the form:

2019-01-17_11:19:31.101+0000: [N] sdrServ: Received Tcp data collection request 
from 10.10.0.33
2019-01-17_11:19:31.102+0000: [N] GPFS will attempt to collect Tcp debug data 
on this node.
2019-01-17_11:24:52.838+0000: [N] sdrServ: Received expel data collection 
request from 10.10.0.33
2019-01-17_11:24:52.838+0000: [N] GPFS will attempt to collect debug data on 
this node.
2019-01-17_11:25:02.741+0000: [N] This node will be expelled from cluster 
rds.gpfs.servers due to expel msg from 10.10.12.41 (b
ber-les-nsd01-data.bb2.cluster in rds.gpfs.server
2019-01-17_11:25:03.160+0000: [N] sdrServ: Received expel data collection 
request from 10.20.0.56

They always appear to be to a specific type of hardware with the same Ethernet 
controller, though the nodes are split across three data centres and we aren’t 
seeing link congestion on the links between them.

On the node I listed above, it’s not actually doing anything either as the 
software on it is still being installed (i.e. it’s not doing GPFS or any other 
IO other than a couple of home directories).

Any suggestions on what “(socket 153) state is unexpected” means?

Thanks

Simon


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to