Hi, Yes, I do see load on the client side, but as the client has 40gb NIC and the load comes from a 10gb WAN link I wouldn't expect it to overload the net. I can correlate the messages with load higher than 6gb from the WAN. Far from the limit of the NIC. The client has a latest generation Xeon processor so I wouldn't expect that to be the bottle neck either.
David On Mon, Dec 23, 2019 at 5:09 PM Degremont, Aurelien <[email protected]> wrote: > Hi > > > > These messages means the client thinks it has lost the communication with > the server and reconnect. The server only sees the reconnection and never > thought the client was gone. > > > > It could be related to lots of things. The server could be receiving RPCs > from this client but not processing them fast enough. Is there other errors > on your server? Is there any high load? > > Same on your clients? Is there any high load that could prevent your > client from communicating with your server properly? > > > > Do you correlate that with some specific load running on your clients? > > > > Aurélien > > > > *De : *lustre-discuss <[email protected]> au nom de > David Cohen <[email protected]> > *Date : *dimanche 22 décembre 2019 à 17:08 > *À : *"[email protected]" <[email protected]> > *Objet : *[lustre-discuss] frequent Connection lost, Connection restored > to mdt > > > > Hi, > > We are running 2.10.5 on the servers and 2.10.8 on the clients. > > Every few minutes, we see: > > > > On client side: > > > > Dec 22 15:26:34 gftp kernel: Lustre: > 439834:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has > timed out for slow reply: [sent 1577021187/real 1577021187] > req@ffff88160be9c6c0 x1653620348981536/t0(0) > o36->[email protected]@tcp:12/10 lens 608/4768 > e 0 to 1 dl 1577021194 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 > Dec 22 15:26:34 gftp kernel: Lustre: > 439834:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 3 previous > similar messages > Dec 22 15:26:34 gftp kernel: Lustre: lustre-MDT0000-mdc-ffff8817d9776c00: > Connection to lustre-MDT0000 (at 10.0.0.1@tcp) was lost; in progress > operations using this service will wait for recovery to complete > Dec 22 15:26:34 gftp kernel: Lustre: Skipped 3 previous similar messages > Dec 22 15:26:34 gftp kernel: Lustre: lustre-MDT0000-mdc-ffff8817d9776c00: > Connection restored to 10.0.0.1@tcp (at 192.114.101.153@tcp) > Dec 22 15:26:34 gftp kernel: Lustre: Skipped 3 previous similar messages > > > > On server side: > > > > Dec 22 15:26:34 oss03 kernel: Lustre: lustre-MDT0000: Client > 38d6eef1-e146-be41-bab9-409b272d0d4f (at 10.0.0.10@tcp) reconnecting > Dec 22 15:26:34 oss03 kernel: Lustre: lustre-MDT0000: Connection restored > to ec2cdfce-353f-583a-c970-fde3f5d5189c (at 10.0.0.10@tcp) > > >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
