Re: [gpfsug-discuss] Node expels

Achim Rehor Tue, 22 Jan 2019 10:18:44 -0800

Hi Simon,

this seems like a very good catch, and the different symptoms seen can be a good explanation for the seen behaviour.

For whatever reason the iptables restart was issued, it seemed to have caused a big number of expels, which sounds logical, since the lease renewal requests might have been blocked by the firewall at these points in time. the expels can be seen in the clustermgrs mmfs.log

Now if a node gets expelled, its share of the Block and Inode Allocation Map need to be recovered by the filesystem managers of the filessystems which were mounted on that expelled node earlier.
In your case, due to the fact that the -numnodes parameter for the filesystem in question is high (2k) and the number of filesets in that filesystem is also high (>1200) the Inode Allocation Map is growing quite big, as it needs to hold a region for every fileset per possibly mounting node (in order to be able to assign this to the node mounting the FS, so that node can do allocations on its own).

This Inode Allocation Map File needs to be scanned by the Inode alloc manager thread, whenever a node leaves the cluster or the fs mgr is being moved, in order to recover that nodes share, and to recalculate the amount of used inodes. During this reintialization phase some operations are blocked (like mmlsfileset for example) . this can lead to a almost hanging cluster.

There are 2 APARs open on this, One is already fixed (IJ11105) (watch out for changelog information : Fixed slow inode expansion on large cluster), the second one is still in testing phase but soon to come.
(changelog should read something like : Speed up inode alloc manager initialization)

Preventing the expels will naturally prevent those phases ;)

Mit freundlichen Grüßen / Kind regards

Achim Rehor



Software Technical Support Specialist AIX/ Emea HPC Support
IBM Certified Advanced Technical Expert - Power Systems with AIX
TSCC Software Service, Dept. 7922
Global Technology Services

Phone:	+49-7034-274-7862	IBM Deutschland
E-Mail:	[email protected]	Am Weiher 24
		65451 Kelsterbach
		Germany



IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Geschäftsführung: Martin Hartmann (Vorsitzender), Norbert Janzen, Stefan Lutz, Nicole Reimer, Dr. Klaus Seifert, Wolfgang Wendt Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 WEEE-Reg.-Nr. DE 99369940

From: Simon Thompson <[email protected]>
To: gpfsug main discussion list <[email protected]>, "Tomer Perry" <[email protected]>
Cc: Yong Ze Chen <[email protected]>
Date: 22/01/2019 15:35
Subject: Re: [gpfsug-discuss] Node expels
Sent by: [email protected]

OK we think we might have a reason for this.

We run iptables on some of our management function nodes, and we found that in some cases, our config management tool can cause a ‘systemctl restart iptables’ to occur (the rule ordering generation was non deterministic meaning it could shuffle rules … we fixed that and made it reload rather than restart). Which takes a fraction of a second, but it appears that this is sufficient for GPFS to get into a state. What I didn’t mention before was that we could get it into a state where the only way to recover was to shutdown the storage cluster and restart it.

I’m not sure why normal expel and recovery doesn’t appear to work in this case, though we’re not 100% certain that its iptables restart. (we just have a very smoky gun at present). (I have a ticket with that question open).

Maybe it’s a combination of having a default DROP policy on iptables as well - we have also switched to ACCEPT and added a DROP rule at the end of the ruleset which gives the same result.

Simon

From: <[email protected]> on behalf of "[email protected]" <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Thursday, 17 January 2019 at 14:31
To: Tomer Perry <[email protected]>, "[email protected]" <[email protected]>
Cc: Yong Ze Chen <[email protected]>
Subject: Re: [gpfsug-discuss] Node expels

>They always appear to be to a specific type of hardware with the same Ethernet controller,

That makes me think you might be seeing packet loss that could require ring buffer tuning (the defaults and limits will differ with different ethernet adapters).

The expel section in the slides on this page has been expanded to include a 'debugging expels section' (slides 19-20, which also reference ring buffer tuning):
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/DEBUG%20Expels/comment/7e4f9433-7ca3-430f-b40b-94777c507381

Regards,
John Lewars
Spectrum Scale Performance, IBM Poughkeepsie

From: Tomer Perry/Israel/IBM
To: gpfsug main discussion list <[email protected]>
Cc: John Lewars/Poughkeepsie/IBM@IBMUS, Yong Ze Chen/China/IBM@IBMCN
Date: 01/17/2019 08:28 AM
Subject: Re: [gpfsug-discuss] Node expels

Hi,

I was asked to elaborate a bit ( thus also adding John and Yong Ze Chen).

As written on the slide:
One of the best ways to determine if a network layer problem is root cause for an expel is to look at the low-level socket details dumped in the ‘extra’ log data (mmfs dump all) saved as part of automatic data collection on Linux GPFS nodes.

So, the idea is that in expel situation, we dump the socket state from the OS ( you can see the same using 'ss -i' for example).
In your example, it shows that the ca_state is 4, there are retransmits, high rto and all the point to a network problem.
You can find more details here: http://www.yonch.com/tech/linux-tcp-congestion-control-internals

Regards,

Tomer Perry
Scalable I/O Development (Spectrum Scale)
email: [email protected]
1 Azrieli Center, Tel Aviv 67021, Israel
Global Tel: +1 720 3422758
Israel Tel: +972 3 9188625
Mobile: +972 52 2554625

From: "Tomer Perry" <[email protected]>
To: gpfsug main discussion list <[email protected]>
Date: 17/01/2019 13:46
Subject: Re: [gpfsug-discuss] Node expels
Sent by: [email protected]

Simon,

Take a look at http://files.gpfsug.org/presentations/2018/USA/Scale_Network_Flow-0.8.pdfslide 13.

Regards,

Tomer Perry
Scalable I/O Development (Spectrum Scale)
email: [email protected]
1 Azrieli Center, Tel Aviv 67021, Israel
Global Tel: +1 720 3422758
Israel Tel: +972 3 9188625
Mobile: +972 52 2554625

From: Simon Thompson <[email protected]>
To: "[email protected]" <[email protected]>
Date: 17/01/2019 13:35
Subject: [gpfsug-discuss] Node expels
Sent by: [email protected]

We’ve recently been seeing quite a few node expels with messages of the form:

2019-01-17_11:19:30.882+0000: [W] The TCP connection to IP address 10.20.0.58 proto-pg-pf01.bear.cluster <c0n236> (socket 153) state is unexpected: state=1 ca_state=4 snd_cwnd=1 snd_ssthresh=5 unacked=5 probes=0 backoff=7 retransmits=7 rto=26496000 rcv_ssthresh=102828 rtt=6729 rttvar=12066 sacked=0 retrans=1 reordering=3 lost=5
2019-01-17_11:19:30.882+0000: [I] tscCheckTcpConn: Sending debug data collection request to node 10.20.0.58 proto-pg-pf01.bear.cluster
2019-01-17_11:19:30.882+0000: Sending request to collect TCP debug data to proto-pg-pf01.bear.cluster localNode
2019-01-17_11:19:30.882+0000: [I] Calling user exit script gpfsSendRequestToNodes: event sendRequestToNodes, Async command /usr/lpp/mmfs/bin/mmcommon.
2019-01-17_11:24:52.611+0000: [E] Timed out in 300 seconds waiting for a commMsgCheckMessages reply from node 10.20.0.58 proto-pg-pf01.bear.cluster. Sending expel message.

On the client node, we see messages of the form:

2019-01-17_11:19:31.101+0000: [N] sdrServ: Received Tcp data collection request from 10.10.0.33
2019-01-17_11:19:31.102+0000: [N] GPFS will attempt to collect Tcp debug data on this node.
2019-01-17_11:24:52.838+0000: [N] sdrServ: Received expel data collection request from 10.10.0.33
2019-01-17_11:24:52.838+0000: [N] GPFS will attempt to collect debug data on this node.
2019-01-17_11:25:02.741+0000: [N] This node will be expelled from cluster rds.gpfs.servers due to expel msg from 10.10.12.41 (b
ber-les-nsd01-data.bb2.cluster in rds.gpfs.server
2019-01-17_11:25:03.160+0000: [N] sdrServ: Received expel data collection request from 10.20.0.56

They always appear to be to a specific type of hardware with the same Ethernet controller, though the nodes are split across three data centres and we aren’t seeing link congestion on the links between them.

On the node I listed above, it’s not actually doing anything either as the software on it is still being installed (i.e. it’s not doing GPFS or any other IO other than a couple of home directories).

Any suggestions on what “(socket 153) state is unexpected” means?

Thanks

Simon

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] Node expels

Reply via email to