Hi Jan Erik,
It was my understanding that the IB hardware router required RDMA CM to
work. By default GPFS doesn't use the RDMA Connection Manager but it can
be enabled (e.g. verbsRdmaCm=enable). I think this requires a restart on
clients/servers (in both clusters) to take effect. Maybe someone else on
the list can comment in more detail-- I've been told folks have
successfully deployed IB routers with GPFS.
-Aaron
On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote:
Dear all
we are currently trying to remote mount a file system in a routed Infiniband
test setup and face problems with dropped RDMA connections. The setup is the
following:
- Spectrum Scale Cluster 1 is setup on four servers which are connected to the
same infiniband network. Additionally they are connected to a fast ethernet
providing ip communication in the network 192.168.11.0/24.
- Spectrum Scale Cluster 2 is setup on four additional servers which are
connected to a second infiniband network. These servers have IPs on their IB
interfaces in the network 192.168.12.0/24.
- IP is routed between 192.168.11.0/24 and 192.168.12.0/24 on a dedicated
machine.
- We have a dedicated IB hardware router connected to both IB subnets.
We tested that the routing, both IP and IB, is working between the two clusters
without problems and that RDMA is working fine both for internal communication
inside cluster 1 and cluster 2
When trying to remote mount a file system from cluster 1 in cluster 2, RDMA
communication is not working as expected. Instead we see error messages on the
remote host (cluster 2)
2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to 192.168.11.4
(iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 2
2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to 192.168.11.4
(iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 2
2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to 192.168.11.1
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733
index 3
2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to 192.168.11.1
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 3
2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to 192.168.11.3
(iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733
index 1
2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to 192.168.11.1
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3
2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to 192.168.11.3
(iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 1
2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to 192.168.11.3
(iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 1
2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to 192.168.11.2
(iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733
index 0
2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to 192.168.11.2
(iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 0
2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to 192.168.11.2
(iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 0
2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to 192.168.11.4
(iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733
index 2
2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to 192.168.11.4
(iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 2
2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to 192.168.11.4
(iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 2
2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to 192.168.11.1
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 error 733
index 3
2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to 192.168.11.1
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 3
2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to 192.168.11.1
(iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0 index 3
and in the cluster with the file system (cluster 1)
2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error
IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in
gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to 192.168.12.5
(iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to
RDMA read error IBV_WC_RETRY_EXC_ERR index 3
2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected to
192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1
fabnum 0 sl 0 index 3
2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error
IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in
gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to 192.168.12.5
(iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to
RDMA read error IBV_WC_RETRY_EXC_ERR index 3
2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected to
192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1
fabnum 0 sl 0 index 3
2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error
IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in
gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to 192.168.12.5
(iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to
RDMA read error IBV_WC_RETRY_EXC_ERR index 3
2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected to
192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1
fabnum 0 sl 0 index 3
2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error
IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in
gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to 192.168.12.5
(iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to
RDMA read error IBV_WC_RETRY_EXC_ERR index 3
2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected to
192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1
fabnum 0 sl 0 index 3
2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error
IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in
gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to 192.168.12.5
(iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 due to
RDMA read error IBV_WC_RETRY_EXC_ERR index 3
2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected to
192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1
fabnum 0 sl 0 index 3
Any advice on how to configure the setup in a way that would allow the remote
mount via routed IB would be very appreciated.
Thank you and best regards
Jan Erik
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss