Hi, we are having a problem with NFS using RDMA protocol over our FDR10 Infiniband network. I previously wrote to NFS mailing list about this, so you may find our discussion there. I have taken some load off the server which was using NFS for backups, and converted it to use SSH, but we are still having critical problems with NFS clients losing connection to the server, causing the clients to hang and needing a reboot. I wanted to check in here before filing a bug with CentOS.

Our setup is a cluster with one head node (NFS server) and 9 compute nodes (NFS clients). All the machines are running CentOS 6.9 2.6.32-696.30.1.el6.x86_64 and using the "Inbox"/CentOS RDMA implementation/drivers (not Mellanox OFED). (We also have other NFS clients but they are using 1GbE for NFS connection and, while they will still hang with messages like "NFS server not responding, retrying" or "timed out", they will eventually recover and don't need a reboot.)

On the server (which is named pac) I will see messages like this:
Jul 30 18:19:38 pac kernel: svcrdma: failed to send reply chunks, rc=-5
Jul 30 18:19:38 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:03:05 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:09:06 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:16:09 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:23:31 pac kernel: svcrdma: Error -107 posting RDMA_READ
Jul 31 15:53:55 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 16:09:19 pac kernel: svcrdma: failed to send reply chunks, rc=-5
Jul 31 16:09:19 pac kernel: svcrdma: failed to send reply chunks, rc=-5

Previously I had also seen messages like "Jul 11 21:09:56 pac kernel: nfsd: peername failed (err 107)!" however have not seen that in this latest hangup.

And on the clients (named n001-n009) I will see messages like this:
Jul 30 18:17:26 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810674024c0 (stale): WR flushed Jul 30 18:17:26 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff88106638a640 (stale): WR flushed Jul 30 18:19:26 n001 kernel: nfs: server 10.10.11.100 not responding, still trying Jul 30 18:19:36 n001 kernel: nfs: server 10.10.10.100 not responding, timed out Jul 30 18:19:38 n001 kernel: rpcrdma: connection to 10.10.11.100:20049 on mlx4_0, memreg 5 slots 32 ird 16
Jul 30 18:19:38 n001 kernel: nfs: server 10.10.11.100 OK
Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810671f02c0 (stale): WR flushed Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810677bda40 (stale): WR flushed Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810677bd940 (stale): WR flushed Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810671f0240 (stale): WR flushed Jul 31 14:43:35 n001 kernel: rpcrdma: connection to 10.10.11.100:20049 on mlx4_0, memreg 5 slots 32 ird 16 Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff881065133140 (stale): WR flushed Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810666e3f00 (stale): WR flushed Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff881063ea0dc0 (stale): WR flushed Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810677bdb40 (stale): WR flushed Jul 31 15:03:05 n001 kernel: rpcrdma: connection to 10.10.11.100:20049 on mlx4_0, memreg 5 slots 32 ird 16 Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff881060e59d40 (stale): WR flushed Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810677efac0 (stale): WR flushed Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff88106638a640 (stale): WR flushed Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810671f03c0 (stale): WR flushed Jul 31 15:09:06 n001 kernel: rpcrdma: connection to 10.10.11.100:20049 on mlx4_0, memreg 5 slots 32 ird 16 Jul 31 15:16:09 n001 kernel: rpcrdma: connection to 10.10.11.100:20049 closed (-103) Jul 31 15:53:32 n001 kernel: nfs: server 10.10.10.100 not responding, timed out Jul 31 16:08:56 n001 kernel: nfs: server 10.10.10.100 not responding, timed out

Jul 30 18:17:26 n002 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff881064461500 (stale): WR flushed Jul 30 18:17:26 n002 kernel: RPC: rpcrdma_sendcq_process_wc: frmr ffff8810604b2600 (stale): WR flushed Jul 30 18:19:26 n002 kernel: nfs: server 10.10.11.100 not responding, still trying Jul 30 18:19:38 n002 kernel: rpcrdma: connection to 10.10.11.100:20049 on mlx4_0, memreg 5 slots 32 ird 16
Jul 30 18:19:38 n002 kernel: nfs: server 10.10.11.100 OK
Jul 31 14:43:35 n002 kernel: rpcrdma: connection to 10.10.11.100:20049 closed (-103) Jul 31 16:08:56 n002 kernel: nfs: server 10.10.10.100 not responding, timed out

Similar messages show up on the other clients n003-n009. After these messages on the clients, their load will continually go up (viewable through Ganglia) (I would guess since they are waiting for NFS mount to re-appear). They aren't reachable any longer through SSH and neither can root log in through console via IPMI web applet (just hangs after entering password, may get to prompt eventually but system load is so high), they need to be rebooted through IPMI interface.

Here is /etc/fstab on the server,
UUID=f15df051-ffb8-408c-8ad2-1987b6f082a2       /       ext3    defaults        
0 1
UUID=c854ee27-32cf-445d-8308-4e6f1a87d364       /boot   ext3    defaults        
0 2
UUID=b92a100f-2521-408b-9b15-93671c6ae056       swap    swap    defaults        
0 0
UUID=a8a7b737-25ed-43a7-ae4b-391c71aa8c08       /data   xfs     defaults        
0 2
UUID=d5692ec2-d5dc-4bb8-98d4-a4fb2ff54748       /projects xfs   defaults        
0 2
/dev/drbd0                                      /newwing xfs    noauto  0 0
UUID=a305f309-d997-43ec-8e4f-78e26b07652f       /working xfs    defaults        
0 2
tmpfs   /dev/shm        tmpfs   defaults        0 0
devpts  /dev/pts        devpts  gid=5,mode=620  0 0
sysfs   /sys            sysfs   defaults        0 0
proc    /proc           proc    defaults        0 0

I read that adding "inode64,nobarrier" for the xfs mount options may help? That is something I can try once the server can be rebooted.

Here is current mounts on the server,
/dev/sda3 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda1 on /boot type ext3 (rw)
/dev/sdc1 on /data type xfs (rw)
/dev/sdb1 on /projects type xfs (rw)
/dev/sde1 on /working type xfs (rw,nobarrier)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)
/dev/drbd0 on /newwing type xfs (rw)

Here is /etc/exports on the server,
/data    10.10.10.0/24(rw,no_root_squash,async)
/data    10.10.11.0/24(rw,no_root_squash,async)
/data    150.x.x.192/27(rw,no_root_squash,async)
/data    150.x.x.64/26(rw,no_root_squash,async)
/home    10.10.10.0/24(rw,no_root_squash,async)
/home    10.10.11.0/24(rw,no_root_squash,async)
/opt    10.10.10.0/24(rw,no_root_squash,async)
/opt    10.10.11.0/24(rw,no_root_squash,async)
/projects    10.10.10.0/24(rw,no_root_squash,async)
/projects    10.10.11.0/24(rw,no_root_squash,async)
/projects    150.x.x.192/27(rw,no_root_squash,async)
/projects    150.x.x.64/26(rw,no_root_squash,async)
/tools    10.10.10.0/24(rw,no_root_squash,async)
/tools    10.10.11.0/24(rw,no_root_squash,async)
/usr/share/gridengine     10.10.10.10/24(rw,no_root_squash,async)
/usr/share/gridengine     10.10.11.10/24(rw,no_root_squash,async)
/usr/local    10.10.10.10/24(rw,no_root_squash,async)
/usr/local    10.10.11.10/24(rw,no_root_squash,async)
/working    10.10.10.0/24(rw,no_root_squash,async)
/working    10.10.11.0/24(rw,no_root_squash,async)
/working    150.x.x.192/27(rw,no_root_squash,async)
/working    150.x.x.64/26(rw,no_root_squash,async)
/newwing    10.10.10.0/24(rw,no_root_squash,async)
/newwing    10.10.11.0/24(rw,no_root_squash,async)
/newwing    150.x.x.192/27(rw,no_root_squash,async)
/newwing    150.x.x.64/26(rw,no_root_squash,async)

The 10.10.10.0/24 network is 1GbE and the 10.10.11.0/24 is the Infiniband. The other networks are also 1GbE. Our cluster nodes will normally mount all of these using the Infiniband with RDMA and the computation jobs will normally be using /working which will see the most reading/writing but /newwing, /projects, and /data are also used.

Here is an /etc/fstab from the nodes,
#NFS/RDMA
#10.10.11.100:/opt                      /opt                    nfs     
rdma,port=20049 0 0
#10.10.11.100:/data                     /data                   nfs     
rdma,port=20049 0 0
#10.10.11.100:/tools                    /tools                  nfs     
rdma,port=20049 0 0
#10.10.11.100:/home                     /home                   nfs     
rdma,port=20049 0 0
#10.10.11.100:/usr/local                        /usr/local              nfs     
rdma,port=20049 0 0
#10.10.11.100:/usr/share/gridengine /usr/share/gridengine nfs rdma,port=20049 0 0
#10.10.11.100:/projects                 /projects               nfs     
rdma,port=20049 0 0
#10.10.11.100:/working                  /working                nfs     
rdma,port=20049 0 0
#10.10.11.100:/newwing                  /newwing                nfs     
rdma,port=20049 0 0

#NFS/IPoIB
10.10.11.100:/opt                       /opt                    nfs     tcp     
        0 0
10.10.11.100:/data                      /data                   nfs     tcp     
        0 0
10.10.11.100:/tools                     /tools                  nfs     tcp     
        0 0
10.10.11.100:/home                      /home                   nfs     tcp     
        0 0
10.10.11.100:/usr/local         /usr/local              nfs     tcp             
0 0
10.10.11.100:/usr/share/gridengine      /usr/share/gridengine   nfs     tcp     
        0 0
10.10.11.100:/projects                  /projects               nfs     tcp     
        0 0
10.10.11.100:/working                   /working                nfs     tcp     
        0 0
10.10.11.100:/newwing                   /newwing                nfs     tcp     
        0 0

#NFS/TCP
#10.10.10.100:/opt                      /opt                    nfs     
defaults        0 0
#10.10.10.100:/data                     /data                   nfs     
defaults        0 0
#10.10.10.100:/tools                    /tools                  nfs     
defaults        0 0
#10.10.10.100:/home                     /home                   nfs     
defaults        0 0
#10.10.10.100:/usr/local                        /usr/local              nfs     
defaults        0 0
#10.10.10.100:/usr/share/gridengine     /usr/share/gridengine   nfs     
defaults        0 0
#10.10.10.100:/projects                 /projects               nfs     
defaults        0 0
#10.10.10.100:/working                  /working                nfs     
defaults        0 0
#10.10.10.100:/newwing                  /newwing                nfs     
defaults        0 0

Here I can switch between different interfaces/protocols for the NFS mounts. Currently we are trying the IPoIB. We haven't started a cluster job yet so not sure how it will perform. With the NFS/TCP over 1GbE the server/nodes would hang from time to time but still did not crash at least, however was of course slow being limited by 1GbE.

We haven't had this problem until recently. I upgraded our cluster to add the two additional nodes (n008 and n009) and we also added more storage to the server (/newwing and /working). The new nodes are AMD EPYC platform whereas the server and the nodes n001-n007 are Intel Xeon platform, not sure if that would cause such a crash. The new nodes were cloned from n001 and only kernel command line and network parameters were changed.

The jobs are submitted to the cluster via Sun Grid Engine, and in total there are about 61 jobs that may start at once and open connections to the NFS server... it sounds like it is a system overload, although the load on the server remains low, under 10%, even as it hangs the load may increase to 80%. The server is a few years old but still has 2x 6-core Intel Xeon E5-2620 v2 @ 2.10GHz with 128GB of RAM.

Would appreciate your assistance to troubleshoot this critical problem and, if needed, gather the required information to submit a bug to the tracker!

Thanks,
--
Chandler
Arizona Genomics Institute
www.genome.arizona.edu
_______________________________________________
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos

Reply via email to