Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
J. Bruce Fields wrote: On Wed, Sep 10, 2008 at 09:40:15AM +0100, Ian Campbell wrote: On Mon, 2008-09-01 at 12:20 -0500, Tom Tucker wrote: Ian Campbell wrote: On Sun, 2008-08-31 at 14:51 -0500, Tom Tucker wrote: Looks like you ran this on the client. Sorry, Ian, I should have been more specific. You need to modify the rpc_debug file on the server. I tried this on the server but it's pretty crippling (the server is quite weedy, 300MHz K3 or something). I'll leave it logging overnight since things should be pretty quiet (and that's often when the problem occurs) but if there's a less aggressive setting than 256 but which would still be useful I could leave it on permanently until the problem happens. Thanks Ian. Unfortunately, that's as fine grained as it gets. 256 (0x100) is the bit for transport logging. Still no repro with the debugging enabled I'm afraid. I'm wondering if the performance hit makes it unlikely to trigger since everything slows to a crawl. I haven't been able to leave it on 24/7 since it makes all the NFS clients nearly unusable but I've had it on overnight and when there's no one around etc, which is when the hang generally seems to occur. OK, thanks for the attempt. We should figure out how to tone down those dprintk's. I'll try to take a closer look at this, but I'm travelling starting tomorrow, and realistically may not find time to look at this for a couple weeks. If someone else gets to it before me, great --b. I forgot to re-enable it one evening and, perhaps inevitably, it crashed overnight. I turned the debugging on and captured the following. It's not much since syslog had given up for some reason, probably when the disk filled up with logs. 192.168.1.5 is a non-problematic client so I don't think the log will be useful since the problematic client (192.168.1.4) isn't mentioned.. Ian. Iain sadly yes. There's a thread stuck holding the BUSY bit or a thread failed to clear the bit properly (maybe an error path). Data continues to arrives, but the transport never gets put back on the queue because it's BUSY bit is set. In other words, this is a different error than the one we've been chasing. If I sent you a patch, could you rebuild the kernel and give it a whirl? Also, can you give me a kernel.org relative commit-id or tag for the kernel that you're using? Thanks, Tom vc: socket ee5fd600 recvfrom(ee5fd764, 0) = 4 [4009900.094587] svc: TCP record, 140 bytes [4009900.098604] svc: socket ee5fd600 recvfrom(e996b08c, 3956) = 140 [4009900.102587] svc: TCP complete record (140 bytes) [4009900.106592] svc: transport ee5fd600 put into queue [4009900.110613] svc: got len=140 [4009900.114645] svc: transport ee5fd600 busy, not enqueued [4009900.118752] svc: socket ee5fd600 sendto([dfbb 232... ], 232) = 232 (addr 192.168.1.5, port=910) [4009900.122636] svc: socket c1888960 TCP data ready (svsk ee5fd600) [4009900.126588] svc: transport ee5fd600 busy, not enqueued [4009900.130601] svc: transport ee5fd600 busy, not enqueued [4009900.134593] svc: server eefc4000 waiting for data (to = 90) [4009900.162711] svc: transport ee5fd600 dequeued, inuse=1 [4009900.166598] svc: server eef4a000, pool 0, transport ee5fd600, inuse=2 [4009900.170594] svc: tcp_recv ee5fd600 data 1 conn 0 close 0 [4009900.174611] svc: socket ee5fd600 recvfrom(ee5fd764, 0) = 4 [4009900.178592] svc: TCP record, 136 bytes [4009900.182606] svc: socket ee5fd600 recvfrom(e975a088, 3960) = 136 [4009900.186592] svc: TCP complete record (136 bytes) [4009900.190599] svc: transport ee5fd600 put into queue [4009900.194605] svc: got len=136 [4009900.198632] svc: transport ee5fd600 busy, not enqueued [4009900.202717] svc: socket ee5fd600 sendto([d6464000 124... ], 124) = 124 (addr 192.168.1.5, port=910) [4009900.206664] svc: socket c1888960 TCP data ready (svsk ee5fd600) [4009900.210594] svc: transport ee5fd600 busy, not enqueued [4009900.214611] svc: transport ee5fd600 busy, not enqueued [4009900.218599] svc: server eef4a000 waiting for data (to = 90) [4009900.254640] svc: transport ee5fd600 dequeued, inuse=1 [4009900.258603] svc: server ee978000, pool 0, transport ee5fd600, inuse=2 [4009900.262601] svc: tcp_recv ee5fd600 data 1 conn 0 close 0 [4009900.266625] svc: socket ee5fd600 recvfrom(ee5fd764, 0) = 4 [4009900.270598] svc: TCP record, 128 bytes [4009900.274618] svc: socket ee5fd600 recvfrom(dd54d080, 3968) = 128 [4009900.278598] svc: TCP complete record (128 bytes) [4009900.282605] svc: transport ee5fd600 put into queue [4009900.286611] svc: got len=128 [4009900.290642] svc: transport ee5fd600 busy, not enqueued [4009900.294788] svc: socket ee5fd600 sendto([efb5c000 116... ], 116) = 116 (addr 192.168.1.5, port=910) [4009900.298707] svc: socket c1888960 TCP data ready (svsk ee5fd600) [4009900.302600] svc: transport ee5fd600 busy, not
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
On Mon, 2008-09-01 at 12:20 -0500, Tom Tucker wrote: Ian Campbell wrote: On Sun, 2008-08-31 at 14:51 -0500, Tom Tucker wrote: Looks like you ran this on the client. Sorry, Ian, I should have been more specific. You need to modify the rpc_debug file on the server. I tried this on the server but it's pretty crippling (the server is quite weedy, 300MHz K3 or something). I'll leave it logging overnight since things should be pretty quiet (and that's often when the problem occurs) but if there's a less aggressive setting than 256 but which would still be useful I could leave it on permanently until the problem happens. Thanks Ian. Unfortunately, that's as fine grained as it gets. 256 (0x100) is the bit for transport logging. Still no repro with the debugging enabled I'm afraid. I'm wondering if the performance hit makes it unlikely to trigger since everything slows to a crawl. I haven't been able to leave it on 24/7 since it makes all the NFS clients nearly unusable but I've had it on overnight and when there's no one around etc, which is when the hang generally seems to occur. I forgot to re-enable it one evening and, perhaps inevitably, it crashed overnight. I turned the debugging on and captured the following. It's not much since syslog had given up for some reason, probably when the disk filled up with logs. 192.168.1.5 is a non-problematic client so I don't think the log will be useful since the problematic client (192.168.1.4) isn't mentioned.. Ian. vc: socket ee5fd600 recvfrom(ee5fd764, 0) = 4 [4009900.094587] svc: TCP record, 140 bytes [4009900.098604] svc: socket ee5fd600 recvfrom(e996b08c, 3956) = 140 [4009900.102587] svc: TCP complete record (140 bytes) [4009900.106592] svc: transport ee5fd600 put into queue [4009900.110613] svc: got len=140 [4009900.114645] svc: transport ee5fd600 busy, not enqueued [4009900.118752] svc: socket ee5fd600 sendto([dfbb 232... ], 232) = 232 (addr 192.168.1.5, port=910) [4009900.122636] svc: socket c1888960 TCP data ready (svsk ee5fd600) [4009900.126588] svc: transport ee5fd600 busy, not enqueued [4009900.130601] svc: transport ee5fd600 busy, not enqueued [4009900.134593] svc: server eefc4000 waiting for data (to = 90) [4009900.162711] svc: transport ee5fd600 dequeued, inuse=1 [4009900.166598] svc: server eef4a000, pool 0, transport ee5fd600, inuse=2 [4009900.170594] svc: tcp_recv ee5fd600 data 1 conn 0 close 0 [4009900.174611] svc: socket ee5fd600 recvfrom(ee5fd764, 0) = 4 [4009900.178592] svc: TCP record, 136 bytes [4009900.182606] svc: socket ee5fd600 recvfrom(e975a088, 3960) = 136 [4009900.186592] svc: TCP complete record (136 bytes) [4009900.190599] svc: transport ee5fd600 put into queue [4009900.194605] svc: got len=136 [4009900.198632] svc: transport ee5fd600 busy, not enqueued [4009900.202717] svc: socket ee5fd600 sendto([d6464000 124... ], 124) = 124 (addr 192.168.1.5, port=910) [4009900.206664] svc: socket c1888960 TCP data ready (svsk ee5fd600) [4009900.210594] svc: transport ee5fd600 busy, not enqueued [4009900.214611] svc: transport ee5fd600 busy, not enqueued [4009900.218599] svc: server eef4a000 waiting for data (to = 90) [4009900.254640] svc: transport ee5fd600 dequeued, inuse=1 [4009900.258603] svc: server ee978000, pool 0, transport ee5fd600, inuse=2 [4009900.262601] svc: tcp_recv ee5fd600 data 1 conn 0 close 0 [4009900.266625] svc: socket ee5fd600 recvfrom(ee5fd764, 0) = 4 [4009900.270598] svc: TCP record, 128 bytes [4009900.274618] svc: socket ee5fd600 recvfrom(dd54d080, 3968) = 128 [4009900.278598] svc: TCP complete record (128 bytes) [4009900.282605] svc: transport ee5fd600 put into queue [4009900.286611] svc: got len=128 [4009900.290642] svc: transport ee5fd600 busy, not enqueued [4009900.294788] svc: socket ee5fd600 sendto([efb5c000 116... ], 116) = 116 (addr 192.168.1.5, port=910) [4009900.298707] svc: socket c1888960 TCP data ready (svsk ee5fd600) [4009900.302600] svc: transport ee5fd600 busy, not enqueued [4009900.306622] svc: transport ee5fd600 busy, not enqueued [4009900.310604] svc: server ee978000 waiting for data (to = 90) [4009900.346682] svc: transport ee5fd600 dequeued, inuse=1 [4009900.350609] svc: server ee8fe000, pool 0, transport ee5fd600, inuse=2 [4009900.354606] svc: tcp_recv ee5fd600 data 1 conn 0 close 0 [4009900.358626] svc: socket ee5fd600 recvfrom(ee5fd764, 0) = 4 [4009900.362603] svc: TCP record, 148 bytes [4009900.366622] svc: socket ee5fd600 recvfrom(eec92094, 3948) = 148 [4009900.370604] svc: TCP complete record (148 bytes) [4009900.374612] svc: transport ee5fd600 put into queue [4009900.378610] svc: got len=148 [4009900.382652] svc: transport ee5fd600 busy, not enqueued [4009900.386829] svc: socket ee5fd600 sendto([e98bf000 232... ], 232) = 232 (addr 192.168.1.5, port=910) [4009900.390688] svc: socket c1888960 TCP data ready (svsk ee5fd600) [4009900.394606] svc: transport ee5fd600 busy, not enqueued [4009900.398621] svc: transport ee5fd600 busy, not
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
On Sat, Aug 30, 2008 at 04:47:41PM +0100, Ian Campbell wrote: On Wed, 2008-08-27 at 09:43 -0500, Tom Tucker wrote: Sure. I've actually tried to reproduce it here unsuccessfully. As a starter, I would suggest turning on transport debugging: # echo 256 /proc/sys/sunrpc/rpc_debug [...] If Ian is willing to create the log (or already has one), I'm certainly willing to look at it. It produced only the following (is that what was expected?): [146866.448112] -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- ---ops-- [146866.448112] 30576 0001 00a0 0 f77a1600 13 f790334015000 xprt_pending fa0ba88e fa0c9df4 [146866.448112] 30577 0004 0080-11 f77a1600 13 f79030000 xprt_sending fa0ba88e fa0c9df4 It's normal to get something like that when you turn it on, yes (unless someone else spots anything odd about that...) but what's really needed is to turn this on and then reproduce the problem--it's the debugging output that goes to the logs during the problem that'll be interesting. --b. - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
On Sun, Aug 31, 2008 at 08:44:02PM +0100, Ian Campbell wrote: On Sun, 2008-08-31 at 15:30 -0400, J. Bruce Fields wrote: On Sat, Aug 30, 2008 at 04:47:41PM +0100, Ian Campbell wrote: On Wed, 2008-08-27 at 09:43 -0500, Tom Tucker wrote: Sure. I've actually tried to reproduce it here unsuccessfully. As a starter, I would suggest turning on transport debugging: # echo 256 /proc/sys/sunrpc/rpc_debug [...] If Ian is willing to create the log (or already has one), I'm certainly willing to look at it. It produced only the following (is that what was expected?): [146866.448112] -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- ---ops-- [146866.448112] 30576 0001 00a0 0 f77a1600 13 f790334015000 xprt_pending fa0ba88e fa0c9df4 [146866.448112] 30577 0004 0080-11 f77a1600 13 f79030000 xprt_sending fa0ba88e fa0c9df4 It's normal to get something like that when you turn it on, yes (unless someone else spots anything odd about that...) but what's really needed is to turn this on and then reproduce the problem--it's the debugging output that goes to the logs during the problem that'll be interesting. That's what I did. The first time I did the echo I just got the header line, then I waited for the repro and since there had been no further logging I ran the echo again and got the three lines above. Sounds like you expected there to be more and ongoing logging? Yes. It's the server side we're interested in, by the way--are you collecting this deubgging on the client? --b. - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
J. Bruce Fields wrote: On Mon, Aug 25, 2008 at 11:04:08AM -0500, Tom Tucker wrote: Trond Myklebust wrote: On Sun, 2008-08-24 at 23:09 +0100, Ian Campbell wrote: (added some quoting from previous mail to save replying twice) On Sun, 2008-08-24 at 15:19 -0400, Trond Myklebust wrote: On Sun, 2008-08-24 at 15:17 -0400, Trond Myklebust wrote: From the tcpdump, it looks as if the NFS server is failing to close the socket, when the client closes its side. You therefore end up getting stuck in the FIN_WAIT2 state (as netstat clearly shows above). Is the server keeping the client in this state for a very long period? Well, it had been around an hour and a half on this occasion. Next time it happens I can wait longer but I'm pretty sure I've come back from time away and it's been wedged for at least a day. How long would you expect it to remain in this state for? The server should ideally start to close the socket as soon as it receives the FIN from the client. I'll have a look at the code. I don't think it should matter how long the connection stays in FIN WAIT, the client should reconnect anyway. Since the client seems to be the variable, I would think it might be an issue with the client reconnect logic? That said, 2.6.25 is when the server side transport switch logic went in. Any chance you could help Trond figure out why the server might be doing this? If not, I'll get to it, but not as soon as I should. Sure. I've actually tried to reproduce it here unsuccessfully. As a starter, I would suggest turning on transport debugging: # echo 256 /proc/sys/sunrpc/rpc_debug Here are my thoughts on how it is supposed to work. Trond if this doesn't match your understanding, please let me know. For the case where the client closes the connection first because it's timeout is shorter (5min vs. 6), the sequence of events should be: client server close sends FIN goes to FIN-WAIT-1 receives FIN replies with ACK and goes to CLOSE-WAIT receives ACK goes to FIN-WAIT-2 calls tcp_state_change callback on socket svc_tcp_state_change sets XPT_CLOSE on the transport and enqueues transport for servicing by svc thread. thread gets dequeued, calls svc_recv svc_recv sees XPT_CLOSE and calls svc_xprt_destroy that closes the socket TCP sends FIN/ACK after close receives FIN/ACK goes to TIME-WAIT since state is TIME-WAIT and reuse port is set, we can re-connect There's a couple places we could be getting stuck: - The state-change callback never gets called. - The state-change callback gets called but svsk is null and it ignores the event - svc_tcp_state_change enqueues the transport, but due to a race bug, the transport doesn't actually get queued and since there is no activity it never closes - something else The message log with RPCDBG_SVCXPRT (256) enabled as shown above would help figure out where it's happening. If Ian is willing to create the log (or already has one), I'm certainly willing to look at it. Tom --b. -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
On Mon, Aug 25, 2008 at 11:04:08AM -0500, Tom Tucker wrote: Trond Myklebust wrote: On Sun, 2008-08-24 at 23:09 +0100, Ian Campbell wrote: (added some quoting from previous mail to save replying twice) On Sun, 2008-08-24 at 15:19 -0400, Trond Myklebust wrote: On Sun, 2008-08-24 at 15:17 -0400, Trond Myklebust wrote: From the tcpdump, it looks as if the NFS server is failing to close the socket, when the client closes its side. You therefore end up getting stuck in the FIN_WAIT2 state (as netstat clearly shows above). Is the server keeping the client in this state for a very long period? Well, it had been around an hour and a half on this occasion. Next time it happens I can wait longer but I'm pretty sure I've come back from time away and it's been wedged for at least a day. How long would you expect it to remain in this state for? The server should ideally start to close the socket as soon as it receives the FIN from the client. I'll have a look at the code. I don't think it should matter how long the connection stays in FIN WAIT, the client should reconnect anyway. Since the client seems to be the variable, I would think it might be an issue with the client reconnect logic? That said, 2.6.25 is when the server side transport switch logic went in. Any chance you could help Trond figure out why the server might be doing this? If not, I'll get to it, but not as soon as I should. --b. - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
Trond Myklebust wrote: On Sun, 2008-08-24 at 23:09 +0100, Ian Campbell wrote: (added some quoting from previous mail to save replying twice) On Sun, 2008-08-24 at 15:19 -0400, Trond Myklebust wrote: On Sun, 2008-08-24 at 15:17 -0400, Trond Myklebust wrote: From the tcpdump, it looks as if the NFS server is failing to close the socket, when the client closes its side. You therefore end up getting stuck in the FIN_WAIT2 state (as netstat clearly shows above). Is the server keeping the client in this state for a very long period? Well, it had been around an hour and a half on this occasion. Next time it happens I can wait longer but I'm pretty sure I've come back from time away and it's been wedged for at least a day. How long would you expect it to remain in this state for? The server should ideally start to close the socket as soon as it receives the FIN from the client. I'll have a look at the code. I don't think it should matter how long the connection stays in FIN WAIT, the client should reconnect anyway. Since the client seems to be the variable, I would think it might be an issue with the client reconnect logic? That said, 2.6.25 is when the server side transport switch logic went in. Tom Trond -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
On Mon, 2008-08-25 at 11:04 -0500, Tom Tucker wrote: I don't think it should matter how long the connection stays in FIN WAIT, the client should reconnect anyway. Which would break the replay cache, since we can't reconnect using the same port. Since the client seems to be the variable, I would think it might be an issue with the client reconnect logic? No. Trond - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
On Tue, 2008-08-26 at 06:23 +1000, Grant Coady wrote: On Fri, 22 Aug 2008 14:56:53 -0700, Trond Myklebust [EMAIL PROTECTED] wrote: On Fri, 2008-08-22 at 22:37 +0100, Ian Campbell wrote: I can ssh to the server fine. The same server also serves my NFS home directory to the box I'm writing this from and I've not seen any trouble with this box at all, it's a 2.6.18-xen box. OK... Are you able to reproduce the problem reliably? If so, can you provide me with a binary tcpdump or wireshark dump? If using tcpdump, then please use something like tcpdump -w /tmp/dump.out -s 9 host myserver.foo.bar and port 2049 -- typo? No. The intention was to record _all_ the info in the packet for analysis, not just random header info. Trond - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
Try '-s 0', from tcpdump(8): Setting snaplen to 0 means use the required length to catch whole packets. -Dan -Original Message- From: Grant Coady [mailto:[EMAIL PROTECTED] Sent: Monday, August 25, 2008 5:29 PM To: Trond Myklebust Cc: Grant Coady; Ian Campbell; John Ronciak; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; Jeff Kirsher; Jesse Brandeburg; Bruce Allan; PJ Waskiewicz; John Ronciak; e1000-devel@lists.sourceforge.net Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export. On Mon, 25 Aug 2008 18:11:12 -0400, Trond Myklebust [EMAIL PROTECTED] wrote: On Tue, 2008-08-26 at 06:23 +1000, Grant Coady wrote: On Fri, 22 Aug 2008 14:56:53 -0700, Trond Myklebust [EMAIL PROTECTED] wrote: On Fri, 2008-08-22 at 22:37 +0100, Ian Campbell wrote: I can ssh to the server fine. The same server also serves my NFS home directory to the box I'm writing this from and I've not seen any trouble with this box at all, it's a 2.6.18-xen box. OK... Are you able to reproduce the problem reliably? If so, can you provide me with a binary tcpdump or wireshark dump? If using tcpdump, then please use something like tcpdump -w /tmp/dump.out -s 9 host myserver.foo.bar and port 2049 -- typo? No. The intention was to record _all_ the info in the packet for analysis, not just random header info. Hi Trond, My tcpdump seems to have a 16 bit snaplen counter: ~# tcpdump -w /tmp/dump.out -s 65535 host deltree and port 2049 tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes ^C0 packets captured 4 packets received by filter 0 packets dropped by kernel ~# tcpdump -w /tmp/dump.out -s 65536 host deltree and port 2049 tcpdump: invalid snaplen 65536 ~# tcpdump --version tcpdump version 3.9.8 libpcap version 0.9.8 So I'm now using: ~# tcpdump -w /tmp/dump.out -s 65535 -C 10 -W 100 host deltree and port 2049 tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes to get a 1GB round-robin trace buffer, I can stop the trace when problem noticed, as it is so long between delay/stall happenings. Then I'll try to trigger the thing. Is this the correct style of trace you are expecting? ~$ /usr/sbin/tcpdump -r /tmp/dump.out00 reading from file /tmp/dump.out00, link-type EN10MB (Ethernet) 10:13:49.719781 IP pooh64.mire.mine.nu.2156510591 deltree.mire.mine.nu.nfs: 116 access fh 0,1/218104576 001f 10:13:49.720215 IP deltree.mire.mine.nu.nfs pooh64.mire.mine.nu.2156510591: reply ok 124 access c 001f 10:13:49.720225 IP pooh64.mire.mine.nu.984 deltree.mire.mine.nu.nfsd: . ack 1649405551 win 5840 10:13:49.720288 IP pooh64.mire.mine.nu.2173287807 deltree.mire.mine.nu.nfs: 136 readdirplus fh 0,1/218104576 512 bytes @ 0 10:13:49.742450 IP deltree.mire.mine.nu.nfs pooh64.mire.mine.nu.2173287807: reply ok 1460 readdirplus Is there some test suite I can use? Compiling kernels over NFS worked fine yesterday, apart from the fastest box' make complaining about clock skew. The kernel booted though, so that was okay. Guess it's back to the interactive editing over NFS and see if the thing manifest the delay/stalls again, I'm on the .27-rc4-git4 kernel as soon as it compiles for the client, NFS server is 2.6.24.7 at the moment. Grant. -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
On Mon, 25 Aug 2008 18:11:12 -0400, Trond Myklebust [EMAIL PROTECTED] wrote: On Tue, 2008-08-26 at 06:23 +1000, Grant Coady wrote: On Fri, 22 Aug 2008 14:56:53 -0700, Trond Myklebust [EMAIL PROTECTED] wrote: On Fri, 2008-08-22 at 22:37 +0100, Ian Campbell wrote: I can ssh to the server fine. The same server also serves my NFS home directory to the box I'm writing this from and I've not seen any trouble with this box at all, it's a 2.6.18-xen box. OK... Are you able to reproduce the problem reliably? If so, can you provide me with a binary tcpdump or wireshark dump? If using tcpdump, then please use something like tcpdump -w /tmp/dump.out -s 9 host myserver.foo.bar and port 2049 -- typo? No. The intention was to record _all_ the info in the packet for analysis, not just random header info. I booted 2.6.27-rc4-git4a and started the tcpdump trace as described in last email, got an immediate stall which cleared after 10-20 seconds, on running a small script exported from the NFS server as user: ~$ cat /home/common/bin/boot-report #! /bin/bash # version=$(uname -r) dmesg /home/common/dmesg-$version # 2006-12-15 no more filter .config #grep = /boot/config-$version /home/common/config-$version cp /boot/config-$version /home/common The script had not created the file on the server when the command prompt returned, during the stall I also opened another client terminal and ran 'df', it stalled too, and both client terminals came back to prompt after the 10-20s delay. Then I hit Ctrl-C on the tcpdump terminal to prepare this report. The trace is only 27 lines, deltree is the server, pooh64 the client: [EMAIL PROTECTED]:~$ /usr/sbin/tcpdump -r /tmp/dump.out00 reading from file /tmp/dump.out00, link-type EN10MB (Ethernet) 10:42:14.061410 IP pooh64.mire.mine.nu.0 deltree.mire.mine.nu.nfs: 0 null 10:42:14.062691 IP deltree.mire.mine.nu.nfsd pooh64.mire.mine.nu.839: S 2746115414:2746115414(0) ack 3659136038 win 5840 mss 1460,nop,nop,sackOK 10:42:14.062708 IP pooh64.mire.mine.nu.839 deltree.mire.mine.nu.nfsd: . ack 1 win 5840 10:42:14.062750 IP pooh64.mire.mine.nu.1392480046 deltree.mire.mine.nu.nfs: 112 getattr fh 0,1/218104576 10:42:14.062765 IP pooh64.mire.mine.nu.1409257262 deltree.mire.mine.nu.nfs: 112 fsstat fh 0,1/218104576 10:42:14.062944 IP deltree.mire.mine.nu.nfsd pooh64.mire.mine.nu.839: . ack 113 win 5840 10:42:14.062951 IP deltree.mire.mine.nu.nfsd pooh64.mire.mine.nu.839: . ack 225 win 5840 10:42:14.063873 IP deltree.mire.mine.nu.nfs pooh64.mire.mine.nu.1392480046: reply ok 116 getattr DIR 40755 ids 1000/10 sz 448 10:42:14.063881 IP pooh64.mire.mine.nu.839 deltree.mire.mine.nu.nfsd: . ack 117 win 5840 10:42:14.063890 IP deltree.mire.mine.nu.nfs pooh64.mire.mine.nu.1409257262: reply ok 88 fsstat tbytes 8422359040 fbytes 6605647872 abytes 6605647872 10:42:14.063893 IP pooh64.mire.mine.nu.839 deltree.mire.mine.nu.nfsd: . ack 205 win 5840 10:42:14.064319 IP pooh64.mire.mine.nu.1426034478 deltree.mire.mine.nu.nfs: 116 access fh 0,1/218104576 001f 10:42:14.064677 IP deltree.mire.mine.nu.nfs pooh64.mire.mine.nu.1426034478: reply ok 124 access c 001f 10:42:14.064732 IP pooh64.mire.mine.nu.1442811694 deltree.mire.mine.nu.nfs: 136 readdirplus fh 0,1/218104576 512 bytes @ 0 10:42:14.067193 IP deltree.mire.mine.nu.nfs pooh64.mire.mine.nu.1442811694: reply ok 1460 readdirplus 10:42:14.067203 IP pooh64.mire.mine.nu.839 deltree.mire.mine.nu.nfsd: . ack 1789 win 8760 10:42:14.067208 IP deltree.mire.mine.nu.nfs pooh64.mire.mine.nu.841889326: reply ok 1324 10:42:14.067328 IP pooh64.mire.mine.nu.1459588910 deltree.mire.mine.nu.nfs: 112 getattr fh Unknown/0103000D022401060003000D0200FE6F 10:42:14.068572 IP deltree.mire.mine.nu.nfs pooh64.mire.mine.nu.1459588910: reply ok 116 getattr DIR 40755 ids 1000/10 sz 448 10:42:14.068619 IP pooh64.mire.mine.nu.1476366126 deltree.mire.mine.nu.nfs: 136 getattr fh Unknown/01060003000D0200676A020002000100 10:42:14.070615 IP deltree.mire.mine.nu.nfs pooh64.mire.mine.nu.1476366126: reply ok 116 getattr DIR 40755 ids 1000/10 sz 360 10:42:14.070652 IP pooh64.mire.mine.nu.1493143342 deltree.mire.mine.nu.nfs: 140 access fh Unknown/01060003000D0200676A020002000100 001f 10:42:14.071576 IP deltree.mire.mine.nu.nfs pooh64.mire.mine.nu.1493143342: reply ok 124 access c 001f 10:42:14.071638 IP pooh64.mire.mine.nu.1509920558 deltree.mire.mine.nu.nfs: 160 readdirplus fh Unknown/01060003000D0200676A020002000100 512 bytes @ 0 10:42:14.073382 IP deltree.mire.mine.nu.nfs pooh64.mire.mine.nu.1509920558: reply ok 1460 readdirplus 10:42:14.074096 IP deltree.mire.mine.nu.nfs pooh64.mire.mine.nu.0: reply Unknown rpc response code=6 516 10:42:14.074102 IP pooh64.mire.mine.nu.839 deltree.mire.mine.nu.nfsd: . ack 5445 win 17520 Sorry I forgot to do
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
On Sun, 2008-08-24 at 19:53 +0100, Ian Campbell wrote: On Fri, 2008-08-22 at 14:56 -0700, Trond Myklebust wrote: On Fri, 2008-08-22 at 22:37 +0100, Ian Campbell wrote: I can ssh to the server fine. The same server also serves my NFS home directory to the box I'm writing this from and I've not seen any trouble with this box at all, it's a 2.6.18-xen box. OK... Are you able to reproduce the problem reliably? If so, can you provide me with a binary tcpdump or wireshark dump? If using tcpdump, then please use something like tcpdump -w /tmp/dump.out -s 9 host myserver.foo.bar and port 2049 Please also try to provide a netstat dump of the current TCP connections as soon as the hang occurs: netstat -t Aug 24 18:08:59 iranon kernel: [168839.556017] nfs: server hopkins not responding, still trying but I wasn't around until 19:38 to spot it. netstat when I got to it was: Proto Recv-Q Send-Q Local Address Foreign Address State tcp0 0 localhost.localdo:50891 localhost.localdom:6543 ESTABLISHED tcp1 0 iranon.hellion.org.:ssh azathoth.hellion.:52682 CLOSE_WAIT tcp0 0 localhost.localdom:6543 localhost.localdo:50893 ESTABLISHED tcp0 0 iranon.hellion.org.:837 hopkins.hellion.org:nfs FIN_WAIT2 tcp0 0 localhost.localdom:6543 localhost.localdo:41831 ESTABLISHED tcp0 0 localhost.localdo:13666 localhost.localdo:59482 ESTABLISHED tcp0 0 localhost.localdo:34288 localhost.localdom:6545 ESTABLISHED tcp0 0 iranon.hellion.org.:ssh azathoth.hellion.:48977 ESTABLISHED tcp0 0 iranon.hellion.org.:ssh azathoth.hellion.:52683 ESTABLISHED tcp0 0 localhost.localdom:6545 localhost.localdo:34288 ESTABLISHED tcp0 0 localhost.localdom:6543 localhost.localdo:50891 ESTABLISHED tcp0 0 localhost.localdo:50893 localhost.localdom:6543 ESTABLISHED tcp0 0 localhost.localdo:41831 localhost.localdom:6543 ESTABLISHED tcp0 87 localhost.localdo:59482 localhost.localdo:13666 ESTABLISHED tcp1 0 localhost.localdom:6543 localhost.localdo:41830 CLOSE_WAIT (iranon is the problematic host .4, azathoth is my desktop machine .5, hopkins is the NFS server .6) tcpdumps are pretty big. I've attached the last 100 packets captured. If you need more I can put the full file up somewhere. -rw-r--r-- 1 root root 1.3G Aug 24 17:57 dump.out0 -rw-r--r-- 1 root root 536M Aug 24 19:38 dump.out1 Ian. From the tcpdump, it looks as if the NFS server is failing to close the socket, when the client closes its side. You therefore end up getting stuck in the FIN_WAIT2 state (as netstat clearly shows above). Is the server keeping the client in this state for a very long period? Cheers Trond - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
On Sun, 2008-08-24 at 23:09 +0100, Ian Campbell wrote: (added some quoting from previous mail to save replying twice) On Sun, 2008-08-24 at 15:19 -0400, Trond Myklebust wrote: On Sun, 2008-08-24 at 15:17 -0400, Trond Myklebust wrote: From the tcpdump, it looks as if the NFS server is failing to close the socket, when the client closes its side. You therefore end up getting stuck in the FIN_WAIT2 state (as netstat clearly shows above). Is the server keeping the client in this state for a very long period? Well, it had been around an hour and a half on this occasion. Next time it happens I can wait longer but I'm pretty sure I've come back from time away and it's been wedged for at least a day. How long would you expect it to remain in this state for? The server should ideally start to close the socket as soon as it receives the FIN from the client. I'll have a look at the code. Trond - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
On Fri, Aug 22, 2008 at 11:13 AM, Ian Campbell [EMAIL PROTECTED] wrote: That's probably also a networking device driver issue candidate: your RPC task is queued up waiting to be sent. What networking card+device driver are you using here? # ethtool -i eth0 driver: e1000 version: 7.3.20-k2-NAPI firmware-version: N/A bus-info: :01:0a.0 There is nothing indicating that the NIC/driver is causing any sort of problem here, at least not with what has been presented so far. When the NFS mount isn't working is the networking still active and working? -- Cheers, John - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
On Fri, Aug 22, 2008 at 1:00 PM, Ian Campbell [EMAIL PROTECTED] wrote: So far as I can tell, yes. I can login via ssh so long as the user doesn't have NFS $HOME, I haven't tried much else and the box isn't locked up at the moment, I'd bet it's fine though. Then it's highly unlikely that the NIC/driver is causing your problem. We'll continue to monitor this thread waiting for any info that this is somehow NIC/driver related. Good luck. -- Cheers, John - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
On Fri, 2008-08-22 at 21:00 +0100, Ian Campbell wrote: On Fri, 2008-08-22 at 12:33 -0700, John Ronciak wrote: On Fri, Aug 22, 2008 at 11:13 AM, Ian Campbell [EMAIL PROTECTED] wrote: That's probably also a networking device driver issue candidate: your RPC task is queued up waiting to be sent. What networking card+device driver are you using here? # ethtool -i eth0 driver: e1000 version: 7.3.20-k2-NAPI firmware-version: N/A bus-info: :01:0a.0 There is nothing indicating that the NIC/driver is causing any sort of problem here, at least not with what has been presented so far. When the NFS mount isn't working is the networking still active and working? So far as I can tell, yes. I can login via ssh so long as the user doesn't have NFS $HOME, I haven't tried much else and the box isn't locked up at the moment, I'd bet it's fine though. ...and the server? Something is preventing that RPC payload from being delivered... Trond - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel
Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.
On Fri, 2008-08-22 at 22:37 +0100, Ian Campbell wrote: I can ssh to the server fine. The same server also serves my NFS home directory to the box I'm writing this from and I've not seen any trouble with this box at all, it's a 2.6.18-xen box. OK... Are you able to reproduce the problem reliably? If so, can you provide me with a binary tcpdump or wireshark dump? If using tcpdump, then please use something like tcpdump -w /tmp/dump.out -s 9 host myserver.foo.bar and port 2049 Please also try to provide a netstat dump of the current TCP connections as soon as the hang occurs: netstat -t Cheers Trond - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel