Hi Jan, I am NOT using the pre-populated cache that mellanox refers to in it's documentation. After chatting with support, I don't believe that's necessary anymore (I didn't get a straight answer out of them).
For the subnet prefix, make sure to use one from the range 0xfec0000000000000-0xfec000000000001f. --------------------------------------------------------------------------------------------------------------- Zach Mance [email protected] (303) 497-1883 HPC Data Infrastructure Group / CISL / NCAR --------------------------------------------------------------------------------------------------------------- On Tue, Mar 13, 2018 at 9:24 AM, Jan Erik Sundermann <[email protected] > wrote: > Hello Zachary > > We are currently changing out setup to have IP over IB on all machines to > be able to enable verbsRdmaCm. > > According to Mellanox (https://community.mellanox.com/docs/DOC-2384) > ibacm requires pre-populated caches to be distributed to all end hosts with > the mapping of IP to the routable GIDs (of both IB subnets). Was this also > required in your successful deployment? > > Best > Jan Erik > > > > On 03/12/2018 11:10 PM, Zachary Mance wrote: > >> Since I am testing out remote mounting with EDR IB routers, I'll add to >> the discussion. >> >> In my lab environment I was seeing the same rdma connections being >> established and then disconnected shortly after. The remote filesystem >> would eventually mount on the clients, but it look a quite a while >> (~2mins). Even after mounting, accessing files or any metadata operations >> would take a while to execute, but eventually it happened. >> >> After enabling verbsRdmaCm, everything mounted just fine and in a timely >> manner. Spectrum Scale was using the librdmacm.so library. >> >> I would first double check that you have both clusters able to talk to >> each other on their IPoIB address, then make sure you enable verbsRdmaCm on >> both clusters. >> >> >> ------------------------------------------------------------ >> --------------------------------------------------- >> Zach Mance [email protected] <mailto:[email protected]> (303) 497-1883 >> HPC Data Infrastructure Group / CISL / NCAR >> ------------------------------------------------------------ >> --------------------------------------------------- >> >> On Thu, Mar 1, 2018 at 1:41 AM, John Hearns <[email protected] >> <mailto:[email protected]>> wrote: >> >> In reply to Stuart, >> our setup is entirely Infiniband. We boot and install over IB, and >> rely heavily on IP over Infiniband. >> >> As for users being 'confused' due to multiple IPs, I would >> appreciate some more depth on that one. >> Sure, all batch systems are sensitive to hostnames (as I know to my >> cost!) but once you get that straightened out why should users care? >> I am not being aggressive, just keen to find out more. >> >> >> >> -----Original Message----- >> From: [email protected] >> <mailto:[email protected]> >> [mailto:[email protected] >> <mailto:[email protected]>] On Behalf Of >> Stuart Barkley >> Sent: Wednesday, February 28, 2018 6:50 PM >> To: gpfsug main discussion list <[email protected] >> <mailto:[email protected]>> >> Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB >> >> The problem with CM is that it seems to require configuring IP over >> Infiniband. >> >> I'm rather strongly opposed to IP over IB. We did run IPoIB years >> ago, but pulled it out of our environment as adding unneeded >> complexity. It requires provisioning IP addresses across the >> Infiniband infrastructure and possibly adding routers to other >> portions of the IP infrastructure. It was also confusing some users >> due to multiple IPs on the compute infrastructure. >> >> We have recently been in discussions with a vendor about their >> support for GPFS over IB and they kept directing us to using CM >> (which still didn't work). CM wasn't necessary once we found out >> about the actual problem (we needed the undocumented >> verbsRdmaUseGidIndexZero configuration option among other things due >> to their use of SR-IOV based virtual IB interfaces). >> >> We don't use routed Infiniband and it might be that CM and IPoIB is >> required for IB routing, but I doubt it. It sounds like the OP is >> keeping IB and IP infrastructure separate. >> >> Stuart Barkley >> >> On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote: >> >> > Date: Mon, 26 Feb 2018 14:16:34 >> > From: Aaron Knister <[email protected] >> <mailto:[email protected]>> >> > Reply-To: gpfsug main discussion list >> > <[email protected] >> <mailto:[email protected]>> >> > To: [email protected] >> <mailto:[email protected]> >> > Subject: Re: [gpfsug-discuss] Problems with remote mount via >> routed IB >> > >> > Hi Jan Erik, >> > >> > It was my understanding that the IB hardware router required RDMA >> CM to work. >> > By default GPFS doesn't use the RDMA Connection Manager but it can >> be >> > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart >> on >> > clients/servers (in both clusters) to take effect. Maybe someone >> else >> > on the list can comment in more detail-- I've been told folks have >> > successfully deployed IB routers with GPFS. >> > >> > -Aaron >> > >> > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote: >> > > >> > > Dear all >> > > >> > > we are currently trying to remote mount a file system in a routed >> > > Infiniband test setup and face problems with dropped RDMA >> > > connections. The setup is the >> > > following: >> > > >> > > - Spectrum Scale Cluster 1 is setup on four servers which are >> > > connected to the same infiniband network. Additionally they are >> > > connected to a fast ethernet providing ip communication in the >> network 192.168.11.0/24 <http://192.168.11.0/24>. >> > > >> > > - Spectrum Scale Cluster 2 is setup on four additional servers >> which >> > > are connected to a second infiniband network. These servers >> have IPs >> > > on their IB interfaces in the network 192.168.12.0/24 >> <http://192.168.12.0/24>. >> > > >> > > - IP is routed between 192.168.11.0/24 <http://192.168.11.0/24> >> and 192.168.12.0/24 <http://192.168.12.0/24> on a >> >> > > dedicated machine. >> > > >> > > - We have a dedicated IB hardware router connected to both IB >> subnets. >> > > >> > > >> > > We tested that the routing, both IP and IB, is working between >> the >> > > two clusters without problems and that RDMA is working fine >> both for >> > > internal communication inside cluster 1 and cluster 2 >> > > >> > > When trying to remote mount a file system from cluster 1 in >> cluster >> > > 2, RDMA communication is not working as expected. Instead we see >> > > error messages on the remote host (cluster 2) >> > > >> > > >> > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 2 >> > > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 2 >> > > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 3 >> > > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 3 >> > > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 1 >> > > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 1 >> > > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to >> > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 1 >> > > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 0 >> > > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 0 >> > > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to >> > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 0 >> > > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 2 >> > > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 2 >> > > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to >> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 2 >> > > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 error 733 index 3 >> > > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 index 3 >> > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to >> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > >> > > >> > > and in the cluster with the file system (cluster 1) >> > > >> > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error >> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in >> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 >> vendor_err >> > > 129 >> > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR >> index 3 >> > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and >> connected >> > > to >> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on >> mlx4_0 >> > > port 1 fabnum 0 sl 0 index 3 >> > > >> > > >> > > >> > > Any advice on how to configure the setup in a way that would >> allow >> > > the remote mount via routed IB would be very appreciated. >> > > >> > > >> > > Thank you and best regards >> > > Jan Erik >> > > >> > > >> > > >> > > >> > > _______________________________________________ >> > > gpfsug-discuss mailing list >> > > gpfsug-discuss at spectrumscale.org <http://spectrumscale.org> >> > > >> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp >> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp >> > >> > > fsug.org >> <http://fsug.org>%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data >> =01%7C01%7Cjohn.h >> > > earns%40asml.com >> <http://40asml.com>%7Ce40045550fc3467dd62808d57ed4d0d9% >> 7Caf73baa8f5944e >> > > >> b2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSRE >> > > YpqcNNP8%3D&reserved=0 >> > > >> > >> > -- >> > Aaron Knister >> > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight >> > Center >> > (301) 286-2776 <tel:%28301%29%20286-2776> >> > _______________________________________________ >> > gpfsug-discuss mailing list >> > gpfsug-discuss at spectrumscale.org <http://spectrumscale.org> >> > >> https://emea01.safelinks.protection.outlook.com/?url=http% >> 3A%2F%2Fgpfs >> <https://emea01.safelinks.protection.outlook.com/?url=http% >> 3A%2F%2Fgpfs> >> > ug.org >> <http://ug.org>%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data= >> 01%7C01%7Cjohn.hearn >> > s%40asml.com >> <http://40asml.com>%7Ce40045550fc3467dd62808d57ed4d0d9% >> 7Caf73baa8f5944eb2a39d >> > >> 93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOS >> REYpqcNNP8 >> > %3D&reserved=0 >> > >> >> -- >> I've never been lost; I was once bewildered for three days, but >> never lost! >> -- Daniel Boone >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org> >> https://emea01.safelinks.protection.outlook.com/?url=http% >> 3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss& >> data=01%7C01%7Cjohn.hearns%40asml.com%7Ce40045550fc3467dd >> 62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1& >> sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8%3D&reserved=0 >> <https://emea01.safelinks.protection.outlook.com/?url=http% >> 3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss& >> data=01%7C01%7Cjohn.hearns%40asml.com%7Ce40045550fc3467dd >> 62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1& >> sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8%3D&reserved=0> >> -- The information contained in this communication and any >> attachments is confidential and may be privileged, and is for the >> sole use of the intended recipient(s). Any unauthorized review, use, >> disclosure or distribution is prohibited. Unless explicitly stated >> otherwise in the body of this communication or the attachment >> thereto (if any), the information is provided on an AS-IS basis >> without any express or implied warranties or liabilities. To the >> extent you are relying on this information, you are doing so at your >> own risk. If you are not the intended recipient, please notify the >> sender immediately by replying to this message and destroy all >> copies of this message and any attachments. Neither the sender nor >> the company/group of companies he or she represents shall be liable >> for the proper and complete transmission of the information >> contained in this communication, or for any delay in its receipt. >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org> >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> <http://gpfsug.org/mailman/listinfo/gpfsug-discuss> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> > -- > > Karlsruhe Institute of Technology (KIT) > Steinbuch Centre for Computing (SCC) > > Jan Erik Sundermann > > Hermann-von-Helmholtz-Platz 1, Building 449, Room 226 > D-76344 Eggenstein-Leopoldshafen > > Tel: +49 721 608 26191 > Email: [email protected] > www.scc.kit.edu > > KIT – The Research University in the Helmholtz Association > > Since 2010, KIT has been certified as a family-friendly university. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
