Greetings List! I have a very small HPC cluster running CentOS 7.2. The lustre servers are running lustre kernel-3.10.0-327.3.1.el7_lustre.x86_64. The clients are running kernel-3.10.0-327.3.1.el7.x86_64.
I have two compute node clients successfully mounting the Lustre file system from the servers. The next two compute clients will not mount lustre. I have the lustre-client-3.8.0-3.10.0_327.3.1.el7.x86_64 and lustre-client-modules-2.8.0-e.10.0_327.3.1.el7.x86_64 rpm installed on all compute clients, including the next two. My InfiniBand network is up and successfully pings the other systems. I can cleanly "modprobe lustre" using /etc/modprobe.d/lustre.conf containing one line: options lnet networks="o2ib0(ib0)". This information is the same on both Lustre client and server systems, all of which use ib0. On the next two compute clients I can successfully "lctl ping mds-ib@o2ib0" and successfully ping the oss similarly. I try to mount the Lustre file system on the next two compute clients via the command "mount -t lustre A.B.C.D@o2ib0:/myLustre /myLustre where the A.B.C.D address exists and works as described above and the Lustre FS is "myLustre" and successfully mounts on the two earlier compute clients. This mount fails on both of my next two compute clients with the STDERR: mount.lustre: mount A.B.C.D@o2ib0:/myLustre /myLustre failed: Input/output error The compute client /var/log/messages file shows: [date] [hostname] kernel: Lustre: 51814:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1480097968/real 1480097992] req@ffff8800aa14000 x1551992831868952/t0(0) o250->MCGA.B.C.D@o2ib@A.B.C.D@o2ib:26:25 lens 520/544 e 0 to 1 dl 1480997973 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 The above appears 2X in a row followed by: [date] [hostname] kernel: LustreError: 15c-8: MGCA.B.C.D@o2ib: The configuration from log 'myLustre-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. [date] [hostname] kernel: Lustre: Unmounted myLustre-client [date] [hostname] kernel: LustreError: 53873:0:(obd_mount.c:1426:lustre_fill_super()) unable to mount (-5) As all four compute nodes are built from a single kickstart file, I do not understand why two compute clients can mount the /myLustre file system and two cannot. The IB fabric on the in-kernel opensm-3.3.10-1.el7.x86_64 looks clean with no entries in the /var/log/opensm-unhealthy-ports-dump. If I go all the way back to the last opensm start I do see a single line in /var/log/opensm.log on the opensm server for the next compute client stating: subn_validate_neighbor: ERR 7518: neighbor does not point back at us (guid: [GUID of my next compute client]) Is this last opensm error completely stopping my Lustre mount when all other IP pings are completely successful? TIA, megan
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org