You mentioned that the servers are on the o2ib0 network, but the error messages 
indicate that the client is trying to communicate with the MDT on the tcp 
network.   The file system configuration needs to be updated to use the updated 
NIDs.  

Doug

> On Jul 11, 2016, at 7:34 AM, Jessica Otey <jo...@nrao.edu> wrote:
> 
> All,
> I am, as before, working on a small test lustre setup (RHEL 6.8, lustre v. 
> 2.4.3) to prepare for upgrading at 1.8.9 lustre production system to 2.4.3 
> (first the servers and lnet routers, then at a subsequent time, the clients). 
> Lustre servers have IB connections, but the clients are 1G ethernet only.
> 
> For the life of me, I cannot get the client to mount via the router on this 
> test system. (Client will mount fine when router is taken out of the 
> equation.) This is the error I am seeing in the syslog from the mount attempt:
> 
> Jul 11 10:15:37 tlclient kernel: Lustre: 
> 3605:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed 
> out for slow reply: [sent 1468246532/real 1468246532]  req@ffff88032a3f9400 
> x1539566484848752/t0(0) 
> o38->tlustre-MDT0000-mdc-ffff88032ad20400@10.7.29.130@tcp:12/10 lens 400/544 
> e 0 to 1 dl 1468246537 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> Jul 11 10:16:07 tlclient kernel: Lustre: 
> 3605:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed 
> out for slow reply: [sent 1468246557/real 1468246557]  req@ffff880629819000 
> x1539566484848764/t0(0) 
> o38->tlustre-MDT0000-mdc-ffff88032ad20400@10.7.29.130@tcp:12/10 lens 400/544 
> e 0 to 1 dl 1468246567 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> Jul 11 10:16:37 tlclient kernel: Lustre: 
> 3605:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed 
> out for slow reply: [sent 1468246582/real 1468246582]  req@ffff88062a371000 
> x1539566484848772/t0(0) 
> o38->tlustre-MDT0000-mdc-ffff88032ad20400@10.7.29.130@tcp:12/10 lens 400/544 
> e 0 to 1 dl 1468246597 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> Jul 11 10:16:44 tlclient kernel: LustreError: 
> 2511:0:(lov_obd.c:937:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, 
> lovrc=1
> Jul 11 10:16:44 tlclient kernel: Lustre: Unmounted tlustre-client
> Jul 11 10:16:44 tlclient kernel: LustreError: 
> 4881:0:(obd_mount.c:1289:lustre_fill_super()) Unable to mount (-4)
> 
> More than one pair of eyes has looked at the configs and confirmed they look 
> okay. But frankly we've got to be missing something since this should (like 
> lustre on a good day) 'just work'.
> 
> If anyone has seen this issue before and could give some advice, it'd be 
> appreciated. One major question I have is whether the problem is a 
> configuration issue or a procedure issue--perhaps the order in which I am 
> doing things is causing the failure? The order I'm following currently is:
> 
> 1) unmount/remove modules on all boxes
> 2) bring up the lnet modules on the router, and bring up the network
> 3) On the mds: add the modules, bring up the network, mount the mdt
> 4) On the oss: add the modules, bring up the network, mount the oss
> 5) On the client: add the modules, bring up the network, attempt to mount 
> client (fails)
> 
> Configs follow below.
> 
> Thanks in advance,
> Jessica
> 
> tlnet (the router)
> [root@tlnet ~]# cat /etc/modprobe.d/lustre.conf
> # tlnet configuration
> alias ib0 ib_ipoib
> alias net-pf-27 ib_sdp
> options lnet networks="o2ib0(ib0),tcp0(em1)" forwarding="enabled"
> 
> [root@tlnet ~]# ifconfig #lo omitted
> em1       Link encap:Ethernet  HWaddr 78:2B:CB:25:A7:E2
>          inet addr:10.7.29.134  Bcast:10.7.29.255 Mask:255.255.255.0
>          UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
>          RX packets:453441 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:264313 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:436188202 (415.9 MiB)  TX bytes:22274957 (21.2 MiB)
> ib0       Link encap:InfiniBand  HWaddr 
> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>          inet addr:10.7.129.134  Bcast:10.7.129.255 Mask:255.255.255.0
>          UP BROADCAST RUNNING MULTICAST  MTU:2044 Metric:1
>          RX packets:650 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:34 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:256
>          RX bytes:75376 (73.6 KiB)  TX bytes:2904 (2.8 KiB)
> 
> tlclient (the client)
> [root@tlclient ~]# cat /etc/modprobe.d/lustre.conf
> options lnet networks="tcp0(em1)" routes="o2ib0 10.7.29.134@tcp0" 
> live_router_check_interval=60 dead_router_check_interval=60
> 
> [root@tlclient ~]# ifconfig #lo omitted
> em1       Link encap:Ethernet  HWaddr 00:26:B9:35:B1:1A
>          inet addr:10.7.29.132  Bcast:10.7.29.255 Mask:255.255.255.0
>          UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
>          RX packets:2817 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:2233 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:354856 (346.5 KiB)  TX bytes:328782 (321.0 KiB)
> 
> [root@tlclient ~]# cat /etc/fstab | grep lustre
> 10.7.129.130@o2ib0:/tlustre    /testlustre    lustre 
> defaults,noauto,user_xattr,flock  0 0
> 
> tlmds/tloss (mdt and oss)
> [root@tloss ~]# cat /etc/modprobe.d/lustre.conf
> alias ib0 ib_ipoib
> alias net-pf-27 ib_sdp
> options lnet networks="o2ib0(ib0)" routes="tcp0 10.7.129.134@o2ib0" 
> live_router_check_interval="60" dead_router_check_interval="60"
> 
> tloss ifconfig
> [root@tloss ~]# ifconfig #lo omitted
> em1       Link encap:Ethernet  HWaddr 78:2B:CB:4A:7A:F8
>          inet addr:10.7.29.131  Bcast:10.7.29.255 Mask:255.255.255.0
>          UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
>          RX packets:7939328 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:4920595 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:7016088640 (6.5 GiB)  TX bytes:447490407 (426.7 MiB)
> ib0       Link encap:InfiniBand  HWaddr 
> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>          inet addr:10.7.129.131  Bcast:10.7.129.255 Mask:255.255.255.0
>          UP BROADCAST RUNNING MULTICAST  MTU:2044 Metric:1
>          RX packets:484688 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:62465 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:256
>          RX bytes:845062706 (805.9 MiB)  TX bytes:919378780 (876.7 MiB)
> 
> tlmds ifconfig
> [root@tlmds ~]# ifconfig #lo omitted
> em1       Link encap:Ethernet  HWaddr 78:2B:CB:28:1D:00
>          inet addr:10.7.29.130  Bcast:10.7.29.255 Mask:255.255.255.0
>          UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
>          RX packets:7849519 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:4847566 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:7049031324 (6.5 GiB)  TX bytes:484594569 (462.1 MiB)
> 
> ib0       Link encap:InfiniBand  HWaddr 
> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>          inet addr:10.7.129.130  Bcast:10.7.129.255 Mask:255.255.255.0
>          UP BROADCAST RUNNING MULTICAST  MTU:2044 Metric:1
>          RX packets:532171 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:64114 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:256
>          RX bytes:946230130 (902.3 MiB)  TX bytes:821297144 (783.2 MiB)
> 
> -- 
> Jessica Otey
> System Administrator II
> North American ALMA Science Center (NAASC)
> National Radio Astronomy Observatory (NRAO)
> Charlottesville, Virginia (USA)
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to