Hi,
In the end the error message "no route to host" was the correct one, to
be taken at face value.
Some iptables rules got accidentally set up on some private network
interfaces and so a GPFS node that was already up was not accessible
from the GPFS nodes that were coming up next, so they would all be expelled.
Regards,
Alex
On 12/15/2015 12:34 PM, Alex Chekholko wrote:
Hi all,
I had a RHEL6.3 / MLNX OFED 1.5.3 / GPFS 3.5.0.10 cluster, which was
working fine.
We tried to upgrade some stuff (our mistake!), specifically the Mellanox
firmwares and the OS and switched to in-built CentOS OFED.
So now I have
CentOS 6.7 / GPFS 3.5.0.29 cluster where the GPFS client nodes refuse to
stay connected. Here is a typical log:
[root@cn1 ~]# cat /var/adm/ras/mmfs.log.latest
Tue Dec 15 12:21:38 PST 2015: runmmfs starting
Removing old /var/adm/ras/mmfs.log.* files:
Unloading modules from /lib/modules/2.6.32-573.8.1.el6.x86_64/extra
Loading modules from /lib/modules/2.6.32-573.8.1.el6.x86_64/extra
Module Size Used by
mmfs26 1836054 0
mmfslinux 330095 1 mmfs26
tracedev 43757 2 mmfs26,mmfslinux
Tue Dec 15 12:21:39.230 2015: mmfsd initializing. {Version: 3.5.0.29
Built: Nov 6 2015 15:28:46} ...
Tue Dec 15 12:21:40.847 2015: VERBS RDMA starting.
Tue Dec 15 12:21:40.849 2015: VERBS RDMA library libibverbs.so.1
(version >= 1.1) loaded and initialized.
Tue Dec 15 12:21:40.850 2015: VERBS RDMA verbsRdmasPerNode reduced from
128 to 98 to match (nsdMaxWorkerThreads 96 + (nspdThreadsPerQueue 2 *
nspdQueues 1)).
Tue Dec 15 12:21:41.122 2015: VERBS RDMA device mlx4_0 port 1 fabnum 0
opened, lid 10, 4x FDR INFINIBAND.
Tue Dec 15 12:21:41.123 2015: VERBS RDMA started.
Tue Dec 15 12:21:41.626 2015: Connecting to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:21:41.627 2015: Connected to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:21:41.628 2015: Connecting to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:21:41.629 2015: Connected to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:21:41.630 2015: Node 10.210.16.41 (hs-gs-02) is now the
Group Leader.
Tue Dec 15 12:21:41.641 2015: mmfsd ready
Tue Dec 15 12:21:41 PST 2015: mmcommon mmfsup invoked. Parameters:
10.210.17.1 10.210.16.41 all
Tue Dec 15 12:21:41 PST 2015: mounting /dev/hsgs
Tue Dec 15 12:21:41.918 2015: Command: mount hsgs
Tue Dec 15 12:21:42.131 2015: Connecting to 10.210.16.42 hs-gs-03 <c0n2>
Tue Dec 15 12:21:42.132 2015: Connecting to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:21:42.133 2015: Connected to 10.210.16.42 hs-gs-03 <c0n2>
Tue Dec 15 12:21:42.134 2015: Connected to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:21:42.148 2015: VERBS RDMA connecting to 10.210.16.41
(hs-gs-02) on mlx4_0 port 1 fabnum 0 index 0
Tue Dec 15 12:21:42.149 2015: VERBS RDMA connected to 10.210.16.41
(hs-gs-02) on mlx4_0 port 1 fabnum 0 sl 0 index 0
Tue Dec 15 12:21:42.153 2015: VERBS RDMA connecting to 10.210.16.40
(hs-gs-01) on mlx4_0 port 1 fabnum 0 index 1
Tue Dec 15 12:21:42.154 2015: VERBS RDMA connected to 10.210.16.40
(hs-gs-01) on mlx4_0 port 1 fabnum 0 sl 0 index 1
Tue Dec 15 12:21:42.171 2015: Connecting to 10.210.16.11 hs-ln01.local
<c0n5>
Tue Dec 15 12:21:42.173 2015: Close connection to 10.210.16.11
hs-ln01.local <c0n5> (No route to host)
Tue Dec 15 12:21:42.174 2015: Retry connection to 10.210.16.11
hs-ln01.local <c0n5>
Tue Dec 15 12:21:42.173 2015: Close connection to 10.210.16.11
hs-ln01.local <c0n5> (No route to host)
Tue Dec 15 12:22:55.322 2015: Request sent to 10.210.16.41 (hs-gs-02) to
expel 10.210.16.11 (hs-ln01.local) from cluster HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:22:55.323 2015: This node will be expelled from cluster
HS-GS-Cluster.hs-gs-01 due to expel msg from 10.210.17.1 (cn1.local)
Tue Dec 15 12:22:55.324 2015: This node is being expelled from the cluster.
Tue Dec 15 12:22:55.323 2015: Lost membership in cluster
HS-GS-Cluster.hs-gs-01. Unmounting file systems.
Tue Dec 15 12:22:55.325 2015: VERBS RDMA closed connection to
10.210.16.41 (hs-gs-02) on mlx4_0 port 1 fabnum 0 index 0
Tue Dec 15 12:22:55.327 2015: Cluster Manager connection broke. Probing
cluster HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:22:55.328 2015: VERBS RDMA closed connection to
10.210.16.40 (hs-gs-01) on mlx4_0 port 1 fabnum 0 index 1
Tue Dec 15 12:22:56.419 2015: Command: err 2: mount hsgs
Tue Dec 15 12:22:56.420 2015: Specified entity, such as a disk or file
system, does not exist.
mount: No such file or directory
Tue Dec 15 12:22:56 PST 2015: finished mounting /dev/hsgs
Tue Dec 15 12:22:56.587 2015: Quorum loss. Probing cluster
HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:22:57.087 2015: Connecting to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:22:57.088 2015: Connected to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:22:57.089 2015: Connecting to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:22:57.090 2015: Connected to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:23:02.090 2015: Connecting to 10.210.16.42 hs-gs-03 <c0p2>
Tue Dec 15 12:23:02.092 2015: Connected to 10.210.16.42 hs-gs-03 <c0p2>
Tue Dec 15 12:23:49.604 2015: Node 10.210.16.41 (hs-gs-02) is now the
Group Leader.
Tue Dec 15 12:23:49.614 2015: mmfsd ready
Tue Dec 15 12:23:49 PST 2015: mmcommon mmfsup invoked. Parameters:
10.210.17.1 10.210.16.41 all
Tue Dec 15 12:23:49 PST 2015: mounting /dev/hsgs
Tue Dec 15 12:23:49.866 2015: Command: mount hsgs
Tue Dec 15 12:23:49.949 2015: Connecting to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:23:49.950 2015: Connected to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:23:49.957 2015: VERBS RDMA connecting to 10.210.16.41
(hs-gs-02) on mlx4_0 port 1 fabnum 0 index 1
Tue Dec 15 12:23:49.958 2015: VERBS RDMA connected to 10.210.16.41
(hs-gs-02) on mlx4_0 port 1 fabnum 0 sl 0 index 1
Tue Dec 15 12:23:49.962 2015: VERBS RDMA connecting to 10.210.16.40
(hs-gs-01) on mlx4_0 port 1 fabnum 0 index 0
Tue Dec 15 12:23:49.963 2015: VERBS RDMA connected to 10.210.16.40
(hs-gs-01) on mlx4_0 port 1 fabnum 0 sl 0 index 0
Tue Dec 15 12:23:49.980 2015: Close connection to 10.210.16.11
hs-ln01.local <c0n5> (No route to host)
Tue Dec 15 12:23:49.981 2015: Retry connection to 10.210.16.11
hs-ln01.local <c0n5>
Tue Dec 15 12:23:49.980 2015: Close connection to 10.210.16.11
hs-ln01.local <c0n5> (No route to host)
Tue Dec 15 12:25:05.321 2015: Request sent to 10.210.16.41 (hs-gs-02) to
expel 10.210.16.11 (hs-ln01.local) from cluster HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:25:05.322 2015: This node will be expelled from cluster
HS-GS-Cluster.hs-gs-01 due to expel msg from 10.210.17.1 (cn1.local)
Tue Dec 15 12:25:05.323 2015: This node is being expelled from the cluster.
Tue Dec 15 12:25:05.324 2015: Lost membership in cluster
HS-GS-Cluster.hs-gs-01. Unmounting file systems.
Tue Dec 15 12:25:05.325 2015: VERBS RDMA closed connection to
10.210.16.41 (hs-gs-02) on mlx4_0 port 1 fabnum 0 index 1
Tue Dec 15 12:25:05.326 2015: VERBS RDMA closed connection to
10.210.16.40 (hs-gs-01) on mlx4_0 port 1 fabnum 0 index 0
Tue Dec 15 12:25:05.327 2015: Cluster Manager connection broke. Probing
cluster HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:25:06.413 2015: Command: err 2: mount hsgs
Tue Dec 15 12:25:06.414 2015: Specified entity, such as a disk or file
system, does not exist.
mount: No such file or directory
Tue Dec 15 12:25:06 PST 2015: finished mounting /dev/hsgs
Tue Dec 15 12:25:06.569 2015: Quorum loss. Probing cluster
HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:25:07.069 2015: Connecting to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:25:07.070 2015: Connected to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:25:07.071 2015: Connecting to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:25:07.072 2015: Connected to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:25:12.072 2015: Connecting to 10.210.16.42 hs-gs-03 <c0p2>
Tue Dec 15 12:25:12.073 2015: Connected to 10.210.16.42 hs-gs-03 <c0p2>
Tue Dec 15 12:25:59.585 2015: Node 10.210.16.41 (hs-gs-02) is now the
Group Leader.
Tue Dec 15 12:25:59.596 2015: mmfsd ready
Tue Dec 15 12:25:59 PST 2015: mmcommon mmfsup invoked. Parameters:
10.210.17.1 10.210.16.41 all
Tue Dec 15 12:25:59 PST 2015: mounting /dev/hsgs
Tue Dec 15 12:25:59.856 2015: Command: mount hsgs
Tue Dec 15 12:25:59.934 2015: Connecting to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:25:59.935 2015: Connected to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:25:59.941 2015: VERBS RDMA connecting to 10.210.16.41
(hs-gs-02) on mlx4_0 port 1 fabnum 0 index 0
Tue Dec 15 12:25:59.942 2015: VERBS RDMA connected to 10.210.16.41
(hs-gs-02) on mlx4_0 port 1 fabnum 0 sl 0 index 0
Tue Dec 15 12:25:59.945 2015: VERBS RDMA connecting to 10.210.16.40
(hs-gs-01) on mlx4_0 port 1 fabnum 0 index 1
Tue Dec 15 12:25:59.947 2015: VERBS RDMA connected to 10.210.16.40
(hs-gs-01) on mlx4_0 port 1 fabnum 0 sl 0 index 1
Tue Dec 15 12:25:59.963 2015: Close connection to 10.210.16.11
hs-ln01.local <c0n5> (No route to host)
Tue Dec 15 12:25:59.964 2015: Retry connection to 10.210.16.11
hs-ln01.local <c0n5>
Tue Dec 15 12:25:59.965 2015: Close connection to 10.210.16.11
hs-ln01.local <c0n5> (No route to host)
Tue Dec 15 12:27:15.457 2015: Request sent to 10.210.16.41 (hs-gs-02) to
expel 10.210.16.11 (hs-ln01.local) from cluster HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:27:15.458 2015: This node will be expelled from cluster
HS-GS-Cluster.hs-gs-01 due to expel msg from 10.210.17.1 (cn1.local)
Tue Dec 15 12:27:15.459 2015: This node is being expelled from the cluster.
Tue Dec 15 12:27:15.460 2015: Lost membership in cluster
HS-GS-Cluster.hs-gs-01. Unmounting file systems.
Tue Dec 15 12:27:15.461 2015: VERBS RDMA closed connection to
10.210.16.41 (hs-gs-02) on mlx4_0 port 1 fabnum 0 index 0
Tue Dec 15 12:27:15.462 2015: Cluster Manager connection broke. Probing
cluster HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:27:15.463 2015: VERBS RDMA closed connection to
10.210.16.40 (hs-gs-01) on mlx4_0 port 1 fabnum 0 index 1
Tue Dec 15 12:27:16.578 2015: Command: err 2: mount hsgs
Tue Dec 15 12:27:16.579 2015: Specified entity, such as a disk or file
system, does not exist.
mount: No such file or directory
Tue Dec 15 12:27:16 PST 2015: finished mounting /dev/hsgs
Tue Dec 15 12:27:16.938 2015: Quorum loss. Probing cluster
HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:27:17.439 2015: Connecting to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:27:17.440 2015: Connected to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:27:17.441 2015: Connecting to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:27:17.442 2015: Connected to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:27:22.442 2015: Connecting to 10.210.16.42 hs-gs-03 <c0p2>
Tue Dec 15 12:27:22.443 2015: Connected to 10.210.16.42 hs-gs-03 <c0p2>
Tue Dec 15 12:28:09.955 2015: Node 10.210.16.41 (hs-gs-02) is now the
Group Leader.
Tue Dec 15 12:28:09.965 2015: mmfsd ready
Tue Dec 15 12:28:10 PST 2015: mmcommon mmfsup invoked. Parameters:
10.210.17.1 10.210.16.41 all
Tue Dec 15 12:28:10 PST 2015: mounting /dev/hsgs
Tue Dec 15 12:28:10.222 2015: Command: mount hsgs
Tue Dec 15 12:28:10.314 2015: Connecting to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:28:10.315 2015: Connected to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:28:10.322 2015: VERBS RDMA connecting to 10.210.16.41
(hs-gs-02) on mlx4_0 port 1 fabnum 0 index 1
Tue Dec 15 12:28:10.323 2015: VERBS RDMA connected to 10.210.16.41
(hs-gs-02) on mlx4_0 port 1 fabnum 0 sl 0 index 1
Tue Dec 15 12:28:10.326 2015: VERBS RDMA connecting to 10.210.16.40
(hs-gs-01) on mlx4_0 port 1 fabnum 0 index 0
Tue Dec 15 12:28:10.328 2015: VERBS RDMA connected to 10.210.16.40
(hs-gs-01) on mlx4_0 port 1 fabnum 0 sl 0 index 0
Tue Dec 15 12:28:10.344 2015: Close connection to 10.210.16.11
hs-ln01.local <c0n5> (No route to host)
Tue Dec 15 12:28:10.345 2015: Retry connection to 10.210.16.11
hs-ln01.local <c0n5>
Tue Dec 15 12:28:10.346 2015: Close connection to 10.210.16.11
hs-ln01.local <c0n5> (No route to host)
All the IB / RDMA stuff looks OK to me, but as soon as the GPFS clients
connect, they try to expel each other. The 4 NSD servers seem just fine
though. Trying the Mellanox OFED 3.x yields the same results, so
somehow I think it's not an IB issue.
[root@cn1 ~]# uname -r
2.6.32-573.8.1.el6.x86_64
[root@cn1 ~]# rpm -qa|grep gpfs
gpfs.gpl-3.5.0-29.noarch
gpfs.docs-3.5.0-29.noarch
gpfs.msg.en_US-3.5.0-29.noarch
gpfs.base-3.5.0-29.x86_64
Does anyone have any suggestions?
Regards,
--
Alex Chekholko [email protected] 347-401-4860
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss