Assuming CentOS 7.5 parallels RHEL 7.5 then you would need Spectrum Scale 4.2.3.9 because that is the release version (along with 5.0.1 PTF1) that supports RHEL 7.5.
Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 [email protected] From: Iban Cabrillo <[email protected]> To: gpfsug-discuss <[email protected]> Date: 06/15/2018 11:16 AM Subject: Re: [gpfsug-discuss] Thousands of CLOSE_WAIT connections Sent by: [email protected] Hi Anderson, Comments are in line From: "Anderson Ferreira Nobre" <[email protected]> To: "gpfsug-discuss" <[email protected]> Cc: "gpfsug-discuss" <[email protected]> Sent: Friday, 15 June, 2018 16:49:14 Subject: Re: [gpfsug-discuss] Thousands of CLOSE_WAIT connections Hi Iban, I think it's necessary more information to be able to help you. Here they are: - Redhat version: Which is 7.2, 7.3 or 7.4? CentOS Linux release 7.5.1804 (Core) - Redhat kernel version: In the FAQ of GPFS has the recommended kernel levels - Platform: Is it x86_64? Yes it is - Is there a reason for you stay in 4.2.3-6? Could you update to 4.2.3-9 or 5.0.1? No, that wasthe default version we get from our costumer we could upgrade to 4.2.3-9 with time... - How is the name resolution? Can you do test ping from one node to another and it's reverse? yes resolution works fine in both directions (there is no firewall or icmp filter) using ethernet private network (not IB) - TCP/IP tuning: What is the TCP/IP parameters you are using? I have used for 7.4 the following: [root@XXXX sysctl.d]# cat 99-ibmscale.conf net.core.somaxconn = 10000 net.core.netdev_max_backlog = 250000 net.ipv4.ip_local_port_range = 2000 65535 net.ipv4.tcp_rfc1337 = 1 net.ipv4.tcp_max_tw_buckets = 1440000 net.ipv4.tcp_mtu_probing = 1 net.ipv4.tcp_window_scaling = 1 net.ipv4.tcp_low_latency = 1 net.ipv4.tcp_max_syn_backlog = 4096 net.ipv4.tcp_fin_timeout = 10 net.core.rmem_default = 4194304 net.core.rmem_max = 4194304 net.core.wmem_default = 4194304 net.core.wmem_max = 4194304 net.core.optmem_max = 4194304 net.ipv4.tcp_rmem=4096 87380 16777216 net.ipv4.tcp_wmem=4096 65536 16777216 vm.min_free_kbytes = 512000 kernel.panic_on_oops = 0 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 vm.swappiness = 0 vm.dirty_ratio = 10 That is mine: net.ipv4.conf.default.accept_source_route = 0 net.core.somaxconn = 8192 net.ipv4.tcp_fin_timeout = 30 kernel.sysrq = 1 kernel.core_uses_pid = 1 net.ipv4.tcp_syncookies = 1 kernel.msgmnb = 65536 kernel.msgmax = 65536 kernel.shmmax = 13491064832 kernel.shmall = 4294967296 net.ipv4.neigh.default.gc_stale_time = 120 net.ipv4.tcp_synack_retries = 10 net.ipv4.tcp_sack = 0 net.ipv4.icmp_echo_ignore_broadcasts = 1 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 net.core.netdev_max_backlog = 250000 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_mem = 16777216 16777216 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 87380 16777216 net.ipv4.tcp_adv_win_scale = 2 net.ipv4.tcp_low_latency = 1 net.ipv4.tcp_reordering = 3 net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_window_scaling = 1 net.ipv4.tcp_max_syn_backlog = 8192 net.ipv4.neigh.default.gc_thresh1 = 30000 net.ipv4.neigh.default.gc_thresh2 = 32000 net.ipv4.neigh.default.gc_thresh3 = 32768 net.ipv4.conf.all.arp_filter = 1 net.ipv4.conf.all.arp_ignore = 1 net.ipv4.neigh.enp3s0.mcast_solicit = 9 net.ipv4.neigh.enp3s0.ucast_solicit = 9 net.ipv6.neigh.enp3s0.ucast_solicit = 9 net.ipv6.neigh.enp3s0.mcast_solicit = 9 net.ipv4.neigh.ib0.mcast_solicit = 18 vm.oom_dump_tasks = 1 vm.min_free_kbytes = 524288 Since we disabled ipv6, we had to rebuild the kernel image with the following command: [root@XXXX ~]# dracut -f -v I did that on Wns but no on GPFS servers... - GPFS tuning parameters: Can you list them? - Spectrum Scale status: Can you send the following outputs: mmgetstate -a -L mmlscluster [root@gpfs01 ~]# mmlscluster GPFS cluster information ======================== GPFS cluster name: gpfsgui.ifca.es GPFS cluster id: 8574383285738337182 GPFS UID domain: gpfsgui.ifca.es Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp Repository type: CCR Node Daemon node name IP address Admin node name Designation -------------------------------------------------------------------------------- 1 gpfs01.ifca.es 10.10.0.111 gpfs01.ifca.es quorum-manager-perfmon 2 gpfs02.ifca.es 10.10.0.112 gpfs02.ifca.es quorum-manager-perfmon 3 gpfsgui.ifca.es 10.10.0.60 gpfsgui.ifca.es quorum-perfmon 9 cloudprv-02-9.ifca.es 10.10.140.26 cloudprv-02-9.ifca.es 10 cloudprv-02-8.ifca.es 10.10.140.25 cloudprv-02-8.ifca.es 13 node1.ifca.es 10.10.151.3 node3.ifca.es ...... 44 node24.ifca.es 10.10.151.24 node24.ifca.es ..... mmhealth cluster show (It was shoutdown by hand) [root@gpfs01 ~]# mmhealth cluster show --verbose Error: The monitoring service is down and does not respond, please restart it. mmhealth cluster show --verbose mmhealth node eventlog 2018-06-12 23:31:31.487471 CET quorum_down ERROR The node is not able to form a quorum with the other available nodes. 2018-06-12 23:31:52.856082 CET ccr_local_server_ok INFO The local GPFS CCR server is reachable PC_LOCAL_SERVER 2018-06-12 23:33:06.397125 CET fs_remount_mount INFO The filesystem gpfs was mounted internal 2018-06-12 23:33:06.400622 CET fs_remount_mount INFO The filesystem gpfs was mounted remount 2018-06-12 23:33:06.787556 CET mounted_fs_check INFO The filesystem gpfs is mounted 2018-06-12 23:33:22.670023 CET quorum_up INFO Quorum achieved 2018-06-13 14:01:51.376885 CET service_removed INFO On the node gpfs01.ifca.es the threshold monitor was removed 2018-06-13 14:01:51.385115 CET service_removed INFO On the node gpfs01.ifca.es the perfmon monitor was removed 2018-06-13 18:41:55.846893 CET quorum_down ERROR The node is not able to form a quorum with the other available nodes. 2018-06-13 18:42:39.217545 CET fs_remount_mount INFO The filesystem gpfs was mounted internal 2018-06-13 18:42:39.221455 CET fs_remount_mount INFO The filesystem gpfs was mounted remount 2018-06-13 18:42:39.653778 CET mounted_fs_check INFO The filesystem gpfs is mounted 2018-06-13 18:42:55.956125 CET quorum_up INFO Quorum achieved 2018-06-13 18:43:17.448980 CET service_running INFO The service perfmon is running on node gpfs01.ifca.es 2018-06-13 18:51:14.157351 CET service_running INFO The service threshold is running on node gpfs01.ifca.es 2018-06-14 08:04:06.341564 CET ib_rdma_nic_unrecognized ERROR IB RDMA NIC mlx5_0/1 was not recognized 2018-06-14 08:04:30.216689 CET quorum_down ERROR The node is not able to form a quorum with the other available nodes. 2018-06-14 08:05:10.836900 CET fs_remount_mount INFO The filesystem gpfs was mounted internal 2018-06-14 08:05:27.135275 CET quorum_up INFO Quorum achieved 2018-06-14 08:05:40.446601 CET fs_remount_mount INFO The filesystem gpfs was mounted remount 2018-06-14 08:05:40.881064 CET mounted_fs_check INFO The filesystem gpfs is mounted 2018-06-14 08:08:56.455851 CET ib_rdma_nic_recognized INFO IB RDMA NIC mlx5_0/1 was recognized 2018-06-14 12:29:58.772033 CET ccr_quorum_nodes_warn WARNING At least one quorum node is not reachable Item=PC_QUORUM_NODES,ErrMsg='Ping CCR quorum nodes failed',Failed='10.10.0.112' 2018-06-14 15:41:57.860925 CET ccr_quorum_nodes_ok INFO All quorum nodes are reachable PC_QUORUM_NODES 2018-06-15 13:04:41.403505 CET pmcollector_down ERROR pmcollector service should be started and is stopped 2018-06-15 15:23:00.121760 CET quorum_down ERROR The node is not able to form a quorum with the other available nodes. 2018-06-15 15:23:43.616075 CET fs_remount_mount INFO The filesystem gpfs was mounted internal 2018-06-15 15:23:43.619593 CET fs_remount_mount INFO The filesystem gpfs was mounted remount 2018-06-15 15:23:44.053493 CET mounted_fs_check INFO The filesystem gpfs is mounted 2018-06-15 15:24:00.219003 CET quorum_up INFO Quorum achieved [root@gpfs02 ~]# mmhealth node eventlog Error: The monitoring service is down and does not respond, please restart it. mmlsnode -L -N waiters non default parameters: [root@gpfs01 ~]# mmdiag --config | grep ! ! ccrEnabled 1 ! cipherList AUTHONLY ! clusterId 8574383285738337182 ! clusterName gpfsgui.ifca.es ! dmapiFileHandleSize 32 ! idleSocketTimeout 0 ! ignorePrefetchLUNCount 1 ! maxblocksize 16777216 ! maxFilesToCache 4000 ! maxInodeDeallocPrefetch 64 ! maxMBpS 6000 ! maxStatCache 512 ! minReleaseLevel 1700 ! myNodeConfigNumber 1 ! pagepool 17179869184 ! socketMaxListenConnections 512 ! socketRcvBufferSize 131072 ! socketSndBufferSize 65536 ! verbsPorts mlx5_0/1 ! verbsRdma enable ! worker1Threads 256 Regards, I Abraços / Regards / Saludos, Anderson Nobre AIX & Power Consultant Master Certified IT Specialist IBM Systems Hardware Client Technical Team – IBM Systems Lab Services Phone: 55-19-2132-4317 E-mail: [email protected] ----- Original message ----- From: Iban Cabrillo <[email protected]> Sent by: [email protected] To: [email protected] Cc: Subject: [gpfsug-discuss] Thousands of CLOSE_WAIT connections Date: Fri, Jun 15, 2018 9:12 AM Dear, We have reinstall recently from gpfs 3.5 to SpectrumScale 4.2.3-6 version redhat 7. We are running two nsd servers and a a gui, there is no firewall on gpfs network, and selinux is disable, I have checked changing the manager and cluster manager node between server with the same result, server 01 always increase the CLOSE_WAIT : Node Daemon node name IP address Admin node name Designation -------------------------------------------------------------------------------- 1 gpfs01.ifca.es 10.10.0.111 gpfs01.ifca.es quorum-manager-perfmon 2 gpfs02.ifca.es 10.10.0.112 gpfs02.ifca.es quorum-manager-perfmon 3 gpfsgui.ifca.es 10.10.0.60 gpfsgui.ifca.es quorum-perfmon ....... Installation and configuration works fine, but now we see that one of the servers do not close the mmfsd connections and this growing for ever while the othe nsd servers is always in the same range: [root@gpfs01 ~]# netstat -putana | grep 1191 | wc -l 19701 [root@gpfs01 ~]# netstat -putana | grep 1191 | grep CLOSE_WAIT| wc -l 19528 .... [root@gpfs02 ~]# netstat -putana | grep 1191 | wc -l 215 [root@gpfs02 ~]# netstat -putana | grep 1191 | grep CLOSE_WAIT| wc -l this is causing that gpfs01 do not answer to cluster commands NSD are balance between server (same size): [root@gpfs02 ~]# mmlsnsd File system Disk name NSD servers --------------------------------------------------------------------------- gpfs nsd1 gpfs01,gpfs02 gpfs nsd2 gpfs01,gpfs02 gpfs nsd3 gpfs02,gpfs01 gpfs nsd4 gpfs02,gpfs01 ..... proccess seems to be similar in both servers, only mmccr is running on server 1 and not in 2 gpfs01 ####### root 9169 1 0 feb07 ? 22:27:54 python /usr/lpp/mmfs/bin/mmsysmon.py root 11533 6154 0 13:41 ? 00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmsdrquery sdrq_fs_info all root 11713 1 0 13:41 ? 00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 root 12367 11533 0 13:43 ? 00:00:00 /usr/lpp/mmfs/bin/mmccr vget mmRunningCommand root 12641 6162 0 13:44 ? 00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmsdrquery sdrq_nsd_info sdrq_nsd_name:sdrq_fs_name:sdrq_storage_pool root 12668 12641 0 13:44 ? 00:00:00 /usr/lpp/mmfs/bin/mmccr fget -c 835 mmsdrfs /var/mmfs/gen/mmsdrfs.12641 root 12950 11713 0 13:44 ? 00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 root 12959 9169 13 13:44 ? 00:00:00 /usr/lpp/mmfs/bin/mmccr check -Y -e root 12968 3150 0 13:45 pts/3 00:00:00 grep --color=auto mm root 19620 26468 38 jun14 ? 11:28:36 /usr/lpp/mmfs/bin/mmfsd root 19701 2 0 jun14 ? 00:00:00 [mmkproc] root 19702 2 0 jun14 ? 00:00:00 [mmkproc] root 19703 2 0 jun14 ? 00:00:00 [mmkproc] root 26468 1 0 jun05 ? 00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/runmmfs [root@gpfs02 ~]# ps -feA | grep mm root 5074 1 0 feb07 ? 01:00:34 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 root 5128 31456 28 jun14 ? 06:18:07 /usr/lpp/mmfs/bin/mmfsd root 5255 2 0 jun14 ? 00:00:00 [mmkproc] root 5256 2 0 jun14 ? 00:00:00 [mmkproc] root 5257 2 0 jun14 ? 00:00:00 [mmkproc] root 15196 5074 0 13:47 ? 00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 root 15265 13117 0 13:47 pts/0 00:00:00 grep --color=auto mm root 31456 1 0 jun05 ? 00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/runmmfs Any idea will be appreciated. Regards, I _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
