Re: 8.0-RC1 NFS client timeout issue

Daniel Braniss Sat, 31 Oct 2009 02:18:53 -0700

> 
> First off, I know that cross posting is evil, but I wanted to try
> and make sure developers saw it.
> 
> On Tue, 27 Oct 2009, Olaf Seibert wrote:
> 
> > I see an annoying behaviour with NFS over TCP. It happens both with nfs
> > and newnfs. This is with FreeBSD/amd64 8.0-RC1 as client. The server is
> > some Linux or perhaps Solaris, I'm not entirely sure.
> >
> > After trying to find something in packet traces, I think I have found
> > something.
> >
> > The scenario seems to be as follows. Sorry for the width of the lines.
> >
> >
> > No.     Time        Source                Destination           Protocol 
> > Info
> >   2296 2992.216855 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 
> > LOOKUP Call (Reply In 2297), DH:0x3819da36/w
> >   2297 2992.217107 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 
> > LOOKUP Reply (Call In 2296) Error:NFS3ERR_NOENT
> >   2298 2992.217141 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 
> > LOOKUP Call (Reply In 2299), DH:0x170cb16a/bin
> >   2299 2992.217334 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 
> > LOOKUP Reply (Call In 2298), FH:0x61b8eb12
> >   2300 2992.217361 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 
> > ACCESS Call (Reply In 2301), FH:0x61b8eb12
> >   2301 2992.217582 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 
> > ACCESS Reply (Call In 2300)
> >   2302 2992.217605 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 
> > LOOKUP Call (Reply In 2303), DH:0x61b8eb12/w
> >   2303 2992.217860 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 
> > LOOKUP Reply (Call In 2302) Error:NFS3ERR_NOENT
> >   2304 2992.318770 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934 
> > > nfs [ACK] Seq=238293 Ack=230289 Win=8192 Len=0 TSV=86492342 TSER=12393434
> >   2306 3011.537520 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 
> > GETATTR Reply (Call In 2305)  Directory mode:2755 uid:4100 gid:4100
> >   2307 3011.637744 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934 
> > > nfs [ACK] Seq=238429 Ack=230405 Win=8192 Len=0 TSV=86511662 TSER=12395366
> >   2308 3371.534980 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs 
> > > 934 [FIN, ACK] Seq=230405 Ack=238429 Win=49232 Len=0 TSV=12431366 
> > TSER=86511662
> >
> >     The server decides, for whatever reason, to terminate the
> >     connection and sends a FIN.
> >
> >   2309 3371.535018 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934 
> > > nfs [ACK] Seq=238429 Ack=230406 Win=8192 Len=0 TSV=86871578 TSER=12431366
> >
> >     Client acknowledges this,
> >
> >   2310 3375.379693 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 
> > ACCESS Call, FH:0x008002a2
> >
> >     but tries to sneak in another call anyway.      [A]
> >
> Probably not the best behaviour, but I think it is technically allowed by 
> TCP. (My TCP is very rusty, but I think the socket should be in
> TCPS_CLOSE_WAIT at this point and the BSD code will have called
> socantrcvmore(), but not socantsndmore().)
> 
> >   2311 3375.474788 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs 
> > > 934 [ACK] Seq=230406 Ack=238569 Win=49232 Len=0 TSV=12431760 TSER=86875423
> >
> >     Server ACKs but doesn't send anything else... [B]
> >
> >     Time passes...
> >
> This is where it seems interesting. It looks to me like the socket upcall
> for receiving the FIN would have happened before this point, setting the
> ct_error.re_status to RPC_CANTRECV, but the code in clnt_vc_call() doesn't
> check for this. (It does check for it happening during and after the
> sosend(), but not before it, from what I can see.)
> 
> >
> > [B] would be a bug of the server in my opinion. If it ACKs a call, it
> > should send a reply. And if it can't, it shouldn't.
> >
> I'll leave this one for the TCP wizzards. I'm not sure what the
> correct behaviour is when data is received on a connection. (I think
> it is waiting for a FIN from the client side at this point.)
> 
> If you could try the following patch and see if it helps, that would be
> appreciated, rick
> ps: I'll try to reproduce the situation here, but I'm not sure if I can.
> --- rpc/clnt_vc.c.sav 2009-10-28 15:44:20.000000000 -0400
> +++ rpc/clnt_vc.c     2009-10-28 15:49:57.000000000 -0400
> @@ -413,6 +413,19 @@
> 
>       cr->cr_xid = xid;
>       mtx_lock(&ct->ct_lock);
> +     /*
> +      * Check to see if the other end has already started to close down
> +      * the connection. If it happens after this point, it will be
> +      * detected below, when cr->cr_error is checked.
> +      */
> +     if (ct->ct_error.re_status == RPC_CANTRECV) {
> +             if (errp != &ct->ct_error) {
> +                     errp->re_errno = ct->ct_error.re_errno;
> +                     errp->re_status = RPC_CANTRECV;
> +             }
> +             stat = RPC_CANTRECV;
> +             goto out;
> +     }
>       TAILQ_INSERT_TAIL(&ct->ct_pending, cr, cr_link);
>       mtx_unlock(&ct->ct_lock);


Did a make buildworld -j8, using as server netapp, freebsd,
and all seems ok.

danny


_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[email protected]"

Re: 8.0-RC1 NFS client timeout issue

Reply via email to