poking at this further, it doesn't look like it's ECMP issue. Are there any known reports of issues when running Lustre over ipoib over an opa fabric? seems a stretch, but it's the only difference in the network at this point.
can anyone suggest somewhere to look for more debug info? /var/log/messages and dmesg, don't reveal much info On Mon, Feb 4, 2019 at 9:19 AM Michael Di Domenico <[email protected]> wrote: > > Has anyone heard of lustre having trouble mounting when ECMP is used > on the compute nodes default gateway? > > I'm trying to mount an existing lustre filesystem on a new cluster, > where the connections ride over OPA IPoIB, which is then converted to > 10ge via four routers. I'm using ECMP to distribute the packets over > the four routers. > > I can mount lustre on other ethernet clients, but not the ones behind > my ECMP gateways. Changing the compute node gateway from ECMP to a > single device doesn't change anything. I'm not easily able to revert > the network side from ECMP to a single route, so i haven't tried that. > > The output i get from mount is, "failed: Input/output error retries left: 0" > > syslog on the client and the MGS seem to show that the connection is > being broken between the MGS and client during the mount with a "timed > oout for slow reply" message. _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
