Hi, Looks like you can’t connect to 10.52.23.5@o2ib server node. You should start by checking that the infiniband is working on that server node. Do a regular ping from the client node to the server node. You can then run a lctl ping to see if the lnet network is working. lctl ping 10.52.23.5@o2ib
Check the /var/log/messages on all the lustre server nodes. See if there are any errors reported there. Couple of days ago I had a similar issue and was seeing page allocation failures in my /var/log/messages file on my OSS server nodes. Hope this helps. -Raj > On Sep 8, 2018, at 8:33 AM, fırat yılmaz <[email protected]> wrote: > > Hi There, > > OS=Centos 7.4 > Lustre Version: Intel® Manager for Lustre* software 4.0.3.0 > İnterconnect: Mellanox OFED, ConnectX-5 > > In one of my lustre client i have Input/output error in df command, i am > unable to see the lustre mount point in df but mtab file shows that lustre is > mounted > > df -h output: > > df: ‘/home’: Input/output error > df: ‘/vol1’: Input/output error > df: ‘/cm/shared’: Input/output error > Filesystem Size Used Avail Use% Mounted on > > cat /etc/mtab |grep lustre > > 10.51.22.11@o2ib:10.51.22.10@o2ib:/lustre/home /home lustre > rw,flock,lazystatfs 0 0 > 10.51.22.11@o2ib:10.51.22.10@o2ib:/lustre /vol1 lustre rw,flock,lazystatfs 0 0 > 10.51.22.11@o2ib:10.51.22.10@o2ib:/lustre/cmshared /cm/shared lustre > rw,flock,lazystatfs 0 0 > > > df -h output: > > df: ‘/home’: Input/output error > df: ‘/vol1’: Input/output error > df: ‘/cm/shared’: Input/output error > Filesystem Size Used Avail Use% Mounted on > > > When i cd to the mounted point i can reach the lustre filesystem, i can > create and delete files and folders. But when i cd to a large fileand run ls > -lah command, response from the lustre client freezes. > > dmesg output: > [84276.460557] Lustre: 5617:0:(client.c:2114:ptlrpc_expire_one_request()) > @@@ Request sent has failed due to network error: [sent 1536408434/real > 1536408489] req@ffff882f31697800 x1610952588839712/t0(0) > o8->[email protected]@o2ib:28/4 lens 520/544 e 0 > to 1 dl 1536408714 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 > [84276.460565] Lustre: 5617:0:(client.c:2114:ptlrpc_expire_one_request()) > Skipped 910 previous similar messages > [84386.986467] LustreError: 122750:0:(llite_lib.c:1772:ll_statfs_internal()) > obd_statfs fails: rc = -5 > [84386.986471] LustreError: 122750:0:(llite_lib.c:1772:ll_statfs_internal()) > Skipped 29 previous similar messages > [84704.429967] LNet: 5429:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed > out tx for 10.52.23.5@o2ib: 4379575 seconds > [84704.429970] LNet: 5429:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped > 863 previous similar messages > [84881.004949] Lustre: 5617:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ > Request sent has failed due to network error: [sent 1536409034/real > 1536409095] req@ffff882f2a6e5700 x1610952588854608/t0(0) > o8->[email protected]@o2ib:28/4 lens 520/544 e 0 > to 1 dl 1536409314 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 > [84881.004957] Lustre: 5617:0:(client.c:2114:ptlrpc_expire_one_request()) > Skipped 863 previous similar messages > [85065.953686] LustreError: 123635:0:(llite_lib.c:1772:ll_statfs_internal()) > obd_statfs fails: rc = -5 > [85065.953689] LustreError: 123635:0:(llite_lib.c:1772:ll_statfs_internal()) > Skipped 26 previous similar messages > > fstab mount options: > lustre flock,_netdev,x-systemd.requires=lnet.service 0 0 > > ib_* benchmark tests are as usual. > > Where should i check? > > Best Regards. > > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
