Since I am testing out remote mounting with EDR IB routers, I'll add to the discussion.
In my lab environment I was seeing the same rdma connections being established and then disconnected shortly after. The remote filesystem would eventually mount on the clients, but it look a quite a while (~2mins). Even after mounting, accessing files or any metadata operations would take a while to execute, but eventually it happened. After enabling verbsRdmaCm, everything mounted just fine and in a timely manner. Spectrum Scale was using the librdmacm.so library. I would first double check that you have both clusters able to talk to each other on their IPoIB address, then make sure you enable verbsRdmaCm on both clusters. --------------------------------------------------------------------------------------------------------------- Zach Mance zma...@ucar.edu (303) 497-1883 HPC Data Infrastructure Group / CISL / NCAR --------------------------------------------------------------------------------------------------------------- On Thu, Mar 1, 2018 at 1:41 AM, John Hearns <john.hea...@asml.com> wrote: > In reply to Stuart, > our setup is entirely Infiniband. We boot and install over IB, and rely > heavily on IP over Infiniband. > > As for users being 'confused' due to multiple IPs, I would appreciate some > more depth on that one. > Sure, all batch systems are sensitive to hostnames (as I know to my cost!) > but once you get that straightened out why should users care? > I am not being aggressive, just keen to find out more. > > > > -----Original Message----- > From: gpfsug-discuss-boun...@spectrumscale.org [mailto:gpfsug-discuss- > boun...@spectrumscale.org] On Behalf Of Stuart Barkley > Sent: Wednesday, February 28, 2018 6:50 PM > To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org> > Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB > > The problem with CM is that it seems to require configuring IP over > Infiniband. > > I'm rather strongly opposed to IP over IB. We did run IPoIB years ago, > but pulled it out of our environment as adding unneeded complexity. It > requires provisioning IP addresses across the Infiniband infrastructure and > possibly adding routers to other portions of the IP infrastructure. It was > also confusing some users due to multiple IPs on the compute infrastructure. > > We have recently been in discussions with a vendor about their support for > GPFS over IB and they kept directing us to using CM (which still didn't > work). CM wasn't necessary once we found out about the actual problem (we > needed the undocumented verbsRdmaUseGidIndexZero configuration option among > other things due to their use of SR-IOV based virtual IB interfaces). > > We don't use routed Infiniband and it might be that CM and IPoIB is > required for IB routing, but I doubt it. It sounds like the OP is keeping > IB and IP infrastructure separate. > > Stuart Barkley > > On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote: > > > Date: Mon, 26 Feb 2018 14:16:34 > > From: Aaron Knister <aaron.s.knis...@nasa.gov> > > Reply-To: gpfsug main discussion list > > <gpfsug-discuss@spectrumscale.org> > > To: gpfsug-discuss@spectrumscale.org > > Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB > > > > Hi Jan Erik, > > > > It was my understanding that the IB hardware router required RDMA CM to > work. > > By default GPFS doesn't use the RDMA Connection Manager but it can be > > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart on > > clients/servers (in both clusters) to take effect. Maybe someone else > > on the list can comment in more detail-- I've been told folks have > > successfully deployed IB routers with GPFS. > > > > -Aaron > > > > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote: > > > > > > Dear all > > > > > > we are currently trying to remote mount a file system in a routed > > > Infiniband test setup and face problems with dropped RDMA > > > connections. The setup is the > > > following: > > > > > > - Spectrum Scale Cluster 1 is setup on four servers which are > > > connected to the same infiniband network. Additionally they are > > > connected to a fast ethernet providing ip communication in the network > 192.168.11.0/24. > > > > > > - Spectrum Scale Cluster 2 is setup on four additional servers which > > > are connected to a second infiniband network. These servers have IPs > > > on their IB interfaces in the network 192.168.12.0/24. > > > > > > - IP is routed between 192.168.11.0/24 and 192.168.12.0/24 on a > > > dedicated machine. > > > > > > - We have a dedicated IB hardware router connected to both IB subnets. > > > > > > > > > We tested that the routing, both IP and IB, is working between the > > > two clusters without problems and that RDMA is working fine both for > > > internal communication inside cluster 1 and cluster 2 > > > > > > When trying to remote mount a file system from cluster 1 in cluster > > > 2, RDMA communication is not working as expected. Instead we see > > > error messages on the remote host (cluster 2) > > > > > > > > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 2 > > > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 2 > > > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 3 > > > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 3 > > > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 1 > > > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 1 > > > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to > > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 1 > > > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 0 > > > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 0 > > > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to > > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 0 > > > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 2 > > > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 2 > > > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to > > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 2 > > > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 error 733 index 3 > > > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 index 3 > > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to > > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > > and in the cluster with the file system (cluster 1) > > > > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error > > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in > > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err > > > 129 > > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3 > > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected > > > to > > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 > > > port 1 fabnum 0 sl 0 index 3 > > > > > > > > > > > > Any advice on how to configure the setup in a way that would allow > > > the remote mount via routed IB would be very appreciated. > > > > > > > > > Thank you and best regards > > > Jan Erik > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp > > > fsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.h > > > earns%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944e > > > b2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSRE > > > YpqcNNP8%3D&reserved=0 > > > > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight > > Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfs > > ug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearn > > s%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d > > 93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8 > > %3D&reserved=0 > > > > -- > I've never been lost; I was once bewildered for three days, but never lost! > -- Daniel Boone > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://emea01.safelinks.protection.outlook.com/?url= > http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug- > discuss&data=01%7C01%7Cjohn.hearns%40asml.com% > 7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad > 61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP > 8%3D&reserved=0 > -- The information contained in this communication and any attachments is > confidential and may be privileged, and is for the sole use of the intended > recipient(s). Any unauthorized review, use, disclosure or distribution is > prohibited. Unless explicitly stated otherwise in the body of this > communication or the attachment thereto (if any), the information is > provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss