Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-09-12 Thread Tom Tucker
J. Bruce Fields wrote:
 On Wed, Sep 10, 2008 at 09:40:15AM +0100, Ian Campbell wrote:
   
 On Mon, 2008-09-01 at 12:20 -0500, Tom Tucker wrote: 
 
 Ian Campbell wrote:
   
 On Sun, 2008-08-31 at 14:51 -0500, Tom Tucker wrote:
   
 
 Looks like you ran this on the client. Sorry, Ian, I should have been
 more specific. You need to modify the rpc_debug file on the server.
 
   
 I tried this on the server but it's pretty crippling (the server is
 quite weedy, 300MHz K3 or something).

 I'll leave it logging overnight since things should be pretty quiet (and
 that's often when the problem occurs) but if there's a less aggressive
 setting than 256 but which would still be useful I could leave it on
 permanently until the problem happens.

   
 
 Thanks Ian. Unfortunately, that's as fine grained as it gets. 256 
 (0x100) is the bit for transport logging.
   
 Still no repro with the debugging enabled I'm afraid.

 I'm wondering if the performance hit makes it unlikely to trigger since
 everything slows to a crawl. I haven't been able to leave it on 24/7
 since it makes all the NFS clients nearly unusable but I've had it on
 overnight and when there's no one around etc, which is when the hang
 generally seems to occur.
 

 OK, thanks for the attempt.  We should figure out how to tone down those
 dprintk's.

 I'll try to take a closer look at this, but I'm travelling starting
 tomorrow, and realistically may not find time to look at this for a
 couple weeks.  If someone else gets to it before me, great

 --b.

   
 I forgot to re-enable it one evening and, perhaps inevitably, it crashed
 overnight. I turned the debugging on and captured the following. It's
 not much since syslog had given up for some reason, probably when the
 disk filled up with logs. 192.168.1.5 is a non-problematic client so I
 don't think the log will be useful since the problematic client
 (192.168.1.4) isn't mentioned..

 Ian.

 

Iain sadly yes. There's a thread stuck holding the BUSY bit or a thread 
failed to clear the bit properly
(maybe an error path). Data continues to arrives, but the transport 
never gets put back on the queue
because it's BUSY bit is set.  In other words, this is a different error 
than the one we've been chasing.

If I sent you a patch, could you rebuild the kernel and give it a whirl? 
Also, can you give me a
kernel.org relative commit-id or tag for the kernel that you're using?

Thanks,
Tom

 vc: socket ee5fd600 recvfrom(ee5fd764, 0) = 4
 [4009900.094587] svc: TCP record, 140 bytes
 [4009900.098604] svc: socket ee5fd600 recvfrom(e996b08c, 3956) = 140
 [4009900.102587] svc: TCP complete record (140 bytes)
 [4009900.106592] svc: transport ee5fd600 put into queue
 [4009900.110613] svc: got len=140
 [4009900.114645] svc: transport ee5fd600 busy, not enqueued
 [4009900.118752] svc: socket ee5fd600 sendto([dfbb 232... ], 232) = 232 
 (addr 192.168.1.5, port=910)
 [4009900.122636] svc: socket c1888960 TCP data ready (svsk ee5fd600)
 [4009900.126588] svc: transport ee5fd600 busy, not enqueued
 [4009900.130601] svc: transport ee5fd600 busy, not enqueued
 [4009900.134593] svc: server eefc4000 waiting for data (to = 90)
 [4009900.162711] svc: transport ee5fd600 dequeued, inuse=1
 [4009900.166598] svc: server eef4a000, pool 0, transport ee5fd600, inuse=2
 [4009900.170594] svc: tcp_recv ee5fd600 data 1 conn 0 close 0
 [4009900.174611] svc: socket ee5fd600 recvfrom(ee5fd764, 0) = 4
 [4009900.178592] svc: TCP record, 136 bytes
 [4009900.182606] svc: socket ee5fd600 recvfrom(e975a088, 3960) = 136
 [4009900.186592] svc: TCP complete record (136 bytes)
 [4009900.190599] svc: transport ee5fd600 put into queue
 [4009900.194605] svc: got len=136
 [4009900.198632] svc: transport ee5fd600 busy, not enqueued
 [4009900.202717] svc: socket ee5fd600 sendto([d6464000 124... ], 124) = 124 
 (addr 192.168.1.5, port=910)
 [4009900.206664] svc: socket c1888960 TCP data ready (svsk ee5fd600)
 [4009900.210594] svc: transport ee5fd600 busy, not enqueued
 [4009900.214611] svc: transport ee5fd600 busy, not enqueued
 [4009900.218599] svc: server eef4a000 waiting for data (to = 90)
 [4009900.254640] svc: transport ee5fd600 dequeued, inuse=1
 [4009900.258603] svc: server ee978000, pool 0, transport ee5fd600, inuse=2
 [4009900.262601] svc: tcp_recv ee5fd600 data 1 conn 0 close 0
 [4009900.266625] svc: socket ee5fd600 recvfrom(ee5fd764, 0) = 4
 [4009900.270598] svc: TCP record, 128 bytes
 [4009900.274618] svc: socket ee5fd600 recvfrom(dd54d080, 3968) = 128
 [4009900.278598] svc: TCP complete record (128 bytes)
 [4009900.282605] svc: transport ee5fd600 put into queue
 [4009900.286611] svc: got len=128
 [4009900.290642] svc: transport ee5fd600 busy, not enqueued
 [4009900.294788] svc: socket ee5fd600 sendto([efb5c000 116... ], 116) = 116 
 (addr 192.168.1.5, port=910)
 [4009900.298707] svc: socket c1888960 TCP data ready (svsk ee5fd600)
 [4009900.302600] svc: transport ee5fd600 busy, not 

Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-09-10 Thread Ian Campbell
On Mon, 2008-09-01 at 12:20 -0500, Tom Tucker wrote: 
 Ian Campbell wrote:
  On Sun, 2008-08-31 at 14:51 -0500, Tom Tucker wrote:

  Looks like you ran this on the client. Sorry, Ian, I should have been
  more specific. You need to modify the rpc_debug file on the server.
  
 
  I tried this on the server but it's pretty crippling (the server is
  quite weedy, 300MHz K3 or something).
 
  I'll leave it logging overnight since things should be pretty quiet (and
  that's often when the problem occurs) but if there's a less aggressive
  setting than 256 but which would still be useful I could leave it on
  permanently until the problem happens.
 

 Thanks Ian. Unfortunately, that's as fine grained as it gets. 256 
 (0x100) is the bit for transport logging.

Still no repro with the debugging enabled I'm afraid.

I'm wondering if the performance hit makes it unlikely to trigger since
everything slows to a crawl. I haven't been able to leave it on 24/7
since it makes all the NFS clients nearly unusable but I've had it on
overnight and when there's no one around etc, which is when the hang
generally seems to occur.

I forgot to re-enable it one evening and, perhaps inevitably, it crashed
overnight. I turned the debugging on and captured the following. It's
not much since syslog had given up for some reason, probably when the
disk filled up with logs. 192.168.1.5 is a non-problematic client so I
don't think the log will be useful since the problematic client
(192.168.1.4) isn't mentioned..

Ian.

vc: socket ee5fd600 recvfrom(ee5fd764, 0) = 4
[4009900.094587] svc: TCP record, 140 bytes
[4009900.098604] svc: socket ee5fd600 recvfrom(e996b08c, 3956) = 140
[4009900.102587] svc: TCP complete record (140 bytes)
[4009900.106592] svc: transport ee5fd600 put into queue
[4009900.110613] svc: got len=140
[4009900.114645] svc: transport ee5fd600 busy, not enqueued
[4009900.118752] svc: socket ee5fd600 sendto([dfbb 232... ], 232) = 232 
(addr 192.168.1.5, port=910)
[4009900.122636] svc: socket c1888960 TCP data ready (svsk ee5fd600)
[4009900.126588] svc: transport ee5fd600 busy, not enqueued
[4009900.130601] svc: transport ee5fd600 busy, not enqueued
[4009900.134593] svc: server eefc4000 waiting for data (to = 90)
[4009900.162711] svc: transport ee5fd600 dequeued, inuse=1
[4009900.166598] svc: server eef4a000, pool 0, transport ee5fd600, inuse=2
[4009900.170594] svc: tcp_recv ee5fd600 data 1 conn 0 close 0
[4009900.174611] svc: socket ee5fd600 recvfrom(ee5fd764, 0) = 4
[4009900.178592] svc: TCP record, 136 bytes
[4009900.182606] svc: socket ee5fd600 recvfrom(e975a088, 3960) = 136
[4009900.186592] svc: TCP complete record (136 bytes)
[4009900.190599] svc: transport ee5fd600 put into queue
[4009900.194605] svc: got len=136
[4009900.198632] svc: transport ee5fd600 busy, not enqueued
[4009900.202717] svc: socket ee5fd600 sendto([d6464000 124... ], 124) = 124 
(addr 192.168.1.5, port=910)
[4009900.206664] svc: socket c1888960 TCP data ready (svsk ee5fd600)
[4009900.210594] svc: transport ee5fd600 busy, not enqueued
[4009900.214611] svc: transport ee5fd600 busy, not enqueued
[4009900.218599] svc: server eef4a000 waiting for data (to = 90)
[4009900.254640] svc: transport ee5fd600 dequeued, inuse=1
[4009900.258603] svc: server ee978000, pool 0, transport ee5fd600, inuse=2
[4009900.262601] svc: tcp_recv ee5fd600 data 1 conn 0 close 0
[4009900.266625] svc: socket ee5fd600 recvfrom(ee5fd764, 0) = 4
[4009900.270598] svc: TCP record, 128 bytes
[4009900.274618] svc: socket ee5fd600 recvfrom(dd54d080, 3968) = 128
[4009900.278598] svc: TCP complete record (128 bytes)
[4009900.282605] svc: transport ee5fd600 put into queue
[4009900.286611] svc: got len=128
[4009900.290642] svc: transport ee5fd600 busy, not enqueued
[4009900.294788] svc: socket ee5fd600 sendto([efb5c000 116... ], 116) = 116 
(addr 192.168.1.5, port=910)
[4009900.298707] svc: socket c1888960 TCP data ready (svsk ee5fd600)
[4009900.302600] svc: transport ee5fd600 busy, not enqueued
[4009900.306622] svc: transport ee5fd600 busy, not enqueued
[4009900.310604] svc: server ee978000 waiting for data (to = 90)
[4009900.346682] svc: transport ee5fd600 dequeued, inuse=1
[4009900.350609] svc: server ee8fe000, pool 0, transport ee5fd600, inuse=2
[4009900.354606] svc: tcp_recv ee5fd600 data 1 conn 0 close 0
[4009900.358626] svc: socket ee5fd600 recvfrom(ee5fd764, 0) = 4
[4009900.362603] svc: TCP record, 148 bytes
[4009900.366622] svc: socket ee5fd600 recvfrom(eec92094, 3948) = 148
[4009900.370604] svc: TCP complete record (148 bytes)
[4009900.374612] svc: transport ee5fd600 put into queue
[4009900.378610] svc: got len=148
[4009900.382652] svc: transport ee5fd600 busy, not enqueued
[4009900.386829] svc: socket ee5fd600 sendto([e98bf000 232... ], 232) = 232 
(addr 192.168.1.5, port=910)
[4009900.390688] svc: socket c1888960 TCP data ready (svsk ee5fd600)
[4009900.394606] svc: transport ee5fd600 busy, not enqueued
[4009900.398621] svc: transport ee5fd600 busy, not 

Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-31 Thread J. Bruce Fields
On Sat, Aug 30, 2008 at 04:47:41PM +0100, Ian Campbell wrote:
 On Wed, 2008-08-27 at 09:43 -0500, Tom Tucker wrote:
  Sure. I've actually tried to reproduce it here unsuccessfully.
  
  As a starter, I would suggest turning on transport debugging:
  
  # echo 256  /proc/sys/sunrpc/rpc_debug
  [...]
  If Ian is willing to create the log (or already has one), I'm
  certainly willing to look at it.
 
 It produced only the following (is that what was expected?):
 
 [146866.448112] -pid- proc flgs status -client- -prog- --rqstp- -timeout 
 -rpcwait -action- ---ops--
 [146866.448112] 30576 0001 00a0  0 f77a1600 13 f790334015000 
 xprt_pending fa0ba88e fa0c9df4
 [146866.448112] 30577 0004 0080-11 f77a1600 13 f79030000 
 xprt_sending fa0ba88e fa0c9df4

It's normal to get something like that when you turn it on, yes (unless
someone else spots anything odd about that...) but what's really needed
is to turn this on and then reproduce the problem--it's the debugging
output that goes to the logs during the problem that'll be interesting.

--b.

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-31 Thread J. Bruce Fields
On Sun, Aug 31, 2008 at 08:44:02PM +0100, Ian Campbell wrote:
 On Sun, 2008-08-31 at 15:30 -0400, J. Bruce Fields wrote:
  On Sat, Aug 30, 2008 at 04:47:41PM +0100, Ian Campbell wrote:
   On Wed, 2008-08-27 at 09:43 -0500, Tom Tucker wrote:
Sure. I've actually tried to reproduce it here unsuccessfully.

As a starter, I would suggest turning on transport debugging:

# echo 256  /proc/sys/sunrpc/rpc_debug
[...]
If Ian is willing to create the log (or already has one), I'm
certainly willing to look at it.
   
   It produced only the following (is that what was expected?):
   
   [146866.448112] -pid- proc flgs status -client- -prog- --rqstp- -timeout 
   -rpcwait -action- ---ops--
   [146866.448112] 30576 0001 00a0  0 f77a1600 13 f790334015000 
   xprt_pending fa0ba88e fa0c9df4
   [146866.448112] 30577 0004 0080-11 f77a1600 13 f79030000 
   xprt_sending fa0ba88e fa0c9df4
  
  It's normal to get something like that when you turn it on, yes (unless
  someone else spots anything odd about that...) but what's really needed
  is to turn this on and then reproduce the problem--it's the debugging
  output that goes to the logs during the problem that'll be interesting.
 
 That's what I did. The first time I did the echo I just got the header
 line, then I waited for the repro and since there had been no further
 logging I ran the echo again and got the three lines above.
 
 Sounds like you expected there to be more and ongoing logging?

Yes.  It's the server side we're interested in, by the way--are you
collecting this deubgging on the client?

--b.

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-27 Thread Tom Tucker
J. Bruce Fields wrote:
 On Mon, Aug 25, 2008 at 11:04:08AM -0500, Tom Tucker wrote:
 Trond Myklebust wrote:
 On Sun, 2008-08-24 at 23:09 +0100, Ian Campbell wrote:
 (added some quoting from previous mail to save replying twice)

 On Sun, 2008-08-24 at 15:19 -0400, Trond Myklebust wrote:
 On Sun, 2008-08-24 at 15:17 -0400, Trond Myklebust wrote:
 From the tcpdump, it looks as if the NFS server is failing to close the
 socket, when the client closes its side. You therefore end up getting
 stuck in the FIN_WAIT2 state (as netstat clearly shows above).

 Is the server keeping the client in this state for a very long
 period?
 Well, it had been around an hour and a half on this occasion. Next time
 it happens I can wait longer but I'm pretty sure I've come back from
 time away and it's been wedged for at least a day. How long would you
 expect it to remain in this state for?
 The server should ideally start to close the socket as soon as it
 receives the FIN from the client. I'll have a look at the code.

 I don't think it should matter how long the connection stays in FIN WAIT,
 the client should reconnect anyway.

 Since the client seems to be the variable, I would think it might be an
 issue with the client reconnect logic?

 That said, 2.6.25 is when the server side transport switch logic went in.
 
 Any chance you could help Trond figure out why the server might be doing
 this?
 
 If not, I'll get to it, but not as soon as I should.
 

Sure. I've actually tried to reproduce it here unsuccessfully.

As a starter, I would suggest turning on transport debugging:

# echo 256  /proc/sys/sunrpc/rpc_debug

Here are my thoughts on how it is supposed to work. Trond if this doesn't match
your understanding, please let me know.

For the case where the client closes the connection first
because it's timeout is shorter (5min vs. 6), the sequence of events should be:


client  server

close
sends FIN goes to FIN-WAIT-1
receives FIN replies with ACK
and goes to CLOSE-WAIT
receives ACK goes to FIN-WAIT-2
calls tcp_state_change callback on 
socket
svc_tcp_state_change sets XPT_CLOSE on
the transport and enqueues transport for
servicing by svc thread.

thread gets dequeued, calls svc_recv
svc_recv sees XPT_CLOSE and calls
svc_xprt_destroy that closes the socket

TCP sends FIN/ACK after close
receives FIN/ACK goes to TIME-WAIT

since state is TIME-WAIT and
reuse port is set, we can re-connect

There's a couple places we could be getting stuck:

- The state-change callback never gets called.
- The state-change callback gets called but svsk is null and it ignores the 
event
- svc_tcp_state_change enqueues the transport, but due to a race bug, the 
transport
   doesn't actually get queued and since there is no activity it never closes
- something else

The message log with RPCDBG_SVCXPRT (256) enabled as shown above would help 
figure
out where it's happening.

If Ian is willing to create the log (or already has one), I'm certainly willing 
to
look at it.

Tom


 --b.
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-26 Thread J. Bruce Fields
On Mon, Aug 25, 2008 at 11:04:08AM -0500, Tom Tucker wrote:
 Trond Myklebust wrote:
 On Sun, 2008-08-24 at 23:09 +0100, Ian Campbell wrote:
 (added some quoting from previous mail to save replying twice)

 On Sun, 2008-08-24 at 15:19 -0400, Trond Myklebust wrote:
 On Sun, 2008-08-24 at 15:17 -0400, Trond Myklebust wrote:
 From the tcpdump, it looks as if the NFS server is failing to close the
 socket, when the client closes its side. You therefore end up getting
 stuck in the FIN_WAIT2 state (as netstat clearly shows above).

 Is the server keeping the client in this state for a very long
 period?
 Well, it had been around an hour and a half on this occasion. Next time
 it happens I can wait longer but I'm pretty sure I've come back from
 time away and it's been wedged for at least a day. How long would you
 expect it to remain in this state for?

 The server should ideally start to close the socket as soon as it
 receives the FIN from the client. I'll have a look at the code.


 I don't think it should matter how long the connection stays in FIN WAIT,
 the client should reconnect anyway.

 Since the client seems to be the variable, I would think it might be an
 issue with the client reconnect logic?

 That said, 2.6.25 is when the server side transport switch logic went in.

Any chance you could help Trond figure out why the server might be doing
this?

If not, I'll get to it, but not as soon as I should.

--b.

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-25 Thread Tom Tucker
Trond Myklebust wrote:
 On Sun, 2008-08-24 at 23:09 +0100, Ian Campbell wrote:
 (added some quoting from previous mail to save replying twice)

 On Sun, 2008-08-24 at 15:19 -0400, Trond Myklebust wrote:
 On Sun, 2008-08-24 at 15:17 -0400, Trond Myklebust wrote:
 From the tcpdump, it looks as if the NFS server is failing to close the
 socket, when the client closes its side. You therefore end up getting
 stuck in the FIN_WAIT2 state (as netstat clearly shows above).

 Is the server keeping the client in this state for a very long
 period?
 Well, it had been around an hour and a half on this occasion. Next time
 it happens I can wait longer but I'm pretty sure I've come back from
 time away and it's been wedged for at least a day. How long would you
 expect it to remain in this state for?
 
 The server should ideally start to close the socket as soon as it
 receives the FIN from the client. I'll have a look at the code.
 

I don't think it should matter how long the connection stays in FIN WAIT,
the client should reconnect anyway.

Since the client seems to be the variable, I would think it might be an
issue with the client reconnect logic?

That said, 2.6.25 is when the server side transport switch logic went in.

Tom

 Trond
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-25 Thread Trond Myklebust
On Mon, 2008-08-25 at 11:04 -0500, Tom Tucker wrote:
 I don't think it should matter how long the connection stays in FIN WAIT,
 the client should reconnect anyway.

Which would break the replay cache, since we can't reconnect using the
same port.

 Since the client seems to be the variable, I would think it might be an
 issue with the client reconnect logic?

No.

Trond


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-25 Thread Trond Myklebust
On Tue, 2008-08-26 at 06:23 +1000, Grant Coady wrote:
 On Fri, 22 Aug 2008 14:56:53 -0700, Trond Myklebust [EMAIL PROTECTED] wrote:
 
 On Fri, 2008-08-22 at 22:37 +0100, Ian Campbell wrote:
  I can ssh to the server fine. The same server also serves my NFS home
  directory to the box I'm writing this from and I've not seen any trouble
  with this box at all, it's a 2.6.18-xen box.
 
 OK... Are you able to reproduce the problem reliably?
 
 If so, can you provide me with a binary tcpdump or wireshark dump? If
 using tcpdump, then please use something like
 
tcpdump -w /tmp/dump.out -s 9 host myserver.foo.bar and port 2049
  --  typo?

No. The intention was to record _all_ the info in the packet for
analysis, not just random header info.

Trond


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-25 Thread Muntz, Daniel
Try '-s 0', from tcpdump(8): Setting snaplen to 0 means use the
required length to catch whole packets.

  -Dan

-Original Message-
From: Grant Coady [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 25, 2008 5:29 PM
To: Trond Myklebust
Cc: Grant Coady; Ian Campbell; John Ronciak;
[EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]; Jeff Kirsher; Jesse Brandeburg; Bruce Allan;
PJ Waskiewicz; John Ronciak; e1000-devel@lists.sourceforge.net
Subject: Re: NFS regression? Odd delays and lockups accessing an NFS
export.

On Mon, 25 Aug 2008 18:11:12 -0400, Trond Myklebust
[EMAIL PROTECTED] wrote:

On Tue, 2008-08-26 at 06:23 +1000, Grant Coady wrote:
 On Fri, 22 Aug 2008 14:56:53 -0700, Trond Myklebust
[EMAIL PROTECTED] wrote:
 
 On Fri, 2008-08-22 at 22:37 +0100, Ian Campbell wrote:
  I can ssh to the server fine. The same server also serves my NFS 
  home directory to the box I'm writing this from and I've not seen 
  any trouble with this box at all, it's a 2.6.18-xen box.
 
 OK... Are you able to reproduce the problem reliably?
 
 If so, can you provide me with a binary tcpdump or wireshark dump? 
 If using tcpdump, then please use something like
 
tcpdump -w /tmp/dump.out -s 9 host myserver.foo.bar and port 
  2049
  --  typo?

No. The intention was to record _all_ the info in the packet for 
analysis, not just random header info.

Hi Trond,

My tcpdump seems to have a 16 bit snaplen counter:

~# tcpdump -w /tmp/dump.out -s 65535 host deltree and port 2049
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size
65535 bytes ^C0 packets captured
4 packets received by filter
0 packets dropped by kernel

~# tcpdump -w /tmp/dump.out -s 65536 host deltree and port 2049
tcpdump: invalid snaplen 65536

~# tcpdump --version
tcpdump version 3.9.8
libpcap version 0.9.8


So I'm now using:
~# tcpdump -w /tmp/dump.out -s 65535 -C 10 -W 100 host deltree and port
2049
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size
65535 bytes

to get a 1GB round-robin trace buffer, I can stop the trace when problem
noticed, as it is so long between delay/stall happenings.  Then I'll try
to trigger the thing.

Is this the correct style of trace you are expecting?

~$ /usr/sbin/tcpdump -r /tmp/dump.out00
reading from file /tmp/dump.out00, link-type EN10MB (Ethernet)
10:13:49.719781 IP pooh64.mire.mine.nu.2156510591 
deltree.mire.mine.nu.nfs: 116 access fh 0,1/218104576 001f
10:13:49.720215 IP deltree.mire.mine.nu.nfs 
pooh64.mire.mine.nu.2156510591: reply ok 124 access c 001f
10:13:49.720225 IP pooh64.mire.mine.nu.984  deltree.mire.mine.nu.nfsd:
. ack 1649405551 win 5840
10:13:49.720288 IP pooh64.mire.mine.nu.2173287807 
deltree.mire.mine.nu.nfs: 136 readdirplus fh 0,1/218104576 512 bytes @ 0
10:13:49.742450 IP deltree.mire.mine.nu.nfs 
pooh64.mire.mine.nu.2173287807: reply ok 1460 readdirplus


Is there some test suite I can use?  Compiling kernels over NFS worked
fine yesterday, apart from the fastest box' make complaining about clock
skew.  The kernel booted though, so that was okay.  

Guess it's back to the interactive editing over NFS and see if the thing
manifest the delay/stalls again, I'm on the .27-rc4-git4 kernel as soon
as it compiles for the client, NFS server is 2.6.24.7 at the moment.

Grant.
--
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to [EMAIL PROTECTED] More majordomo info
at  http://vger.kernel.org/majordomo-info.html

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-25 Thread Grant Coady
On Mon, 25 Aug 2008 18:11:12 -0400, Trond Myklebust [EMAIL PROTECTED] wrote:

On Tue, 2008-08-26 at 06:23 +1000, Grant Coady wrote:
 On Fri, 22 Aug 2008 14:56:53 -0700, Trond Myklebust [EMAIL PROTECTED] 
 wrote:
 
 On Fri, 2008-08-22 at 22:37 +0100, Ian Campbell wrote:
  I can ssh to the server fine. The same server also serves my NFS home
  directory to the box I'm writing this from and I've not seen any trouble
  with this box at all, it's a 2.6.18-xen box.
 
 OK... Are you able to reproduce the problem reliably?
 
 If so, can you provide me with a binary tcpdump or wireshark dump? If
 using tcpdump, then please use something like
 
tcpdump -w /tmp/dump.out -s 9 host myserver.foo.bar and port 2049
  --  typo?

No. The intention was to record _all_ the info in the packet for
analysis, not just random header info.

I booted 2.6.27-rc4-git4a and started the tcpdump trace as described in 
last email, got an immediate stall which cleared after 10-20 seconds, on 
running a small script exported from the NFS server as user:

~$ cat /home/common/bin/boot-report
#! /bin/bash
#
version=$(uname -r)

dmesg /home/common/dmesg-$version

# 2006-12-15 no more filter .config
#grep = /boot/config-$version /home/common/config-$version
cp /boot/config-$version /home/common


The script had not created the file on the server when the command prompt 
returned, during the stall I also opened another client terminal and ran 
'df', it stalled too, and both client terminals came back to prompt after 
the 10-20s delay.  Then I hit Ctrl-C on the tcpdump terminal to prepare 
this report.

The trace is only 27 lines, deltree is the server, pooh64 the client:
[EMAIL PROTECTED]:~$ /usr/sbin/tcpdump -r /tmp/dump.out00
reading from file /tmp/dump.out00, link-type EN10MB (Ethernet)
10:42:14.061410 IP pooh64.mire.mine.nu.0  deltree.mire.mine.nu.nfs: 0 null
10:42:14.062691 IP deltree.mire.mine.nu.nfsd  pooh64.mire.mine.nu.839: S 
2746115414:2746115414(0) ack 3659136038 win 5840 mss 1460,nop,nop,sackOK
10:42:14.062708 IP pooh64.mire.mine.nu.839  deltree.mire.mine.nu.nfsd: . ack 1 
win 5840
10:42:14.062750 IP pooh64.mire.mine.nu.1392480046  deltree.mire.mine.nu.nfs: 
112 getattr fh 0,1/218104576
10:42:14.062765 IP pooh64.mire.mine.nu.1409257262  deltree.mire.mine.nu.nfs: 
112 fsstat fh 0,1/218104576
10:42:14.062944 IP deltree.mire.mine.nu.nfsd  pooh64.mire.mine.nu.839: . ack 
113 win 5840
10:42:14.062951 IP deltree.mire.mine.nu.nfsd  pooh64.mire.mine.nu.839: . ack 
225 win 5840
10:42:14.063873 IP deltree.mire.mine.nu.nfs  pooh64.mire.mine.nu.1392480046: 
reply ok 116 getattr DIR 40755 ids 1000/10 sz 448
10:42:14.063881 IP pooh64.mire.mine.nu.839  deltree.mire.mine.nu.nfsd: . ack 
117 win 5840
10:42:14.063890 IP deltree.mire.mine.nu.nfs  pooh64.mire.mine.nu.1409257262: 
reply ok 88 fsstat tbytes 8422359040 fbytes 6605647872 abytes 6605647872
10:42:14.063893 IP pooh64.mire.mine.nu.839  deltree.mire.mine.nu.nfsd: . ack 
205 win 5840
10:42:14.064319 IP pooh64.mire.mine.nu.1426034478  deltree.mire.mine.nu.nfs: 
116 access fh 0,1/218104576 001f
10:42:14.064677 IP deltree.mire.mine.nu.nfs  pooh64.mire.mine.nu.1426034478: 
reply ok 124 access c 001f
10:42:14.064732 IP pooh64.mire.mine.nu.1442811694  deltree.mire.mine.nu.nfs: 
136 readdirplus fh 0,1/218104576 512 bytes @ 0
10:42:14.067193 IP deltree.mire.mine.nu.nfs  pooh64.mire.mine.nu.1442811694: 
reply ok 1460 readdirplus
10:42:14.067203 IP pooh64.mire.mine.nu.839  deltree.mire.mine.nu.nfsd: . ack 
1789 win 8760
10:42:14.067208 IP deltree.mire.mine.nu.nfs  pooh64.mire.mine.nu.841889326: 
reply ok 1324
10:42:14.067328 IP pooh64.mire.mine.nu.1459588910  deltree.mire.mine.nu.nfs: 
112 getattr fh 
Unknown/0103000D022401060003000D0200FE6F
10:42:14.068572 IP deltree.mire.mine.nu.nfs  pooh64.mire.mine.nu.1459588910: 
reply ok 116 getattr DIR 40755 ids 1000/10 sz 448
10:42:14.068619 IP pooh64.mire.mine.nu.1476366126  deltree.mire.mine.nu.nfs: 
136 getattr fh 
Unknown/01060003000D0200676A020002000100
10:42:14.070615 IP deltree.mire.mine.nu.nfs  pooh64.mire.mine.nu.1476366126: 
reply ok 116 getattr DIR 40755 ids 1000/10 sz 360
10:42:14.070652 IP pooh64.mire.mine.nu.1493143342  deltree.mire.mine.nu.nfs: 
140 access fh 
Unknown/01060003000D0200676A020002000100 001f
10:42:14.071576 IP deltree.mire.mine.nu.nfs  pooh64.mire.mine.nu.1493143342: 
reply ok 124 access c 001f
10:42:14.071638 IP pooh64.mire.mine.nu.1509920558  deltree.mire.mine.nu.nfs: 
160 readdirplus fh 
Unknown/01060003000D0200676A020002000100 512 
bytes @ 0
10:42:14.073382 IP deltree.mire.mine.nu.nfs  pooh64.mire.mine.nu.1509920558: 
reply ok 1460 readdirplus
10:42:14.074096 IP deltree.mire.mine.nu.nfs  pooh64.mire.mine.nu.0: reply 
Unknown rpc response code=6 516
10:42:14.074102 IP pooh64.mire.mine.nu.839  deltree.mire.mine.nu.nfsd: . ack 
5445 win 17520

Sorry I forgot to do 

Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-24 Thread Trond Myklebust
On Sun, 2008-08-24 at 19:53 +0100, Ian Campbell wrote:
 On Fri, 2008-08-22 at 14:56 -0700, Trond Myklebust wrote:
  On Fri, 2008-08-22 at 22:37 +0100, Ian Campbell wrote:
   I can ssh to the server fine. The same server also serves my NFS home
   directory to the box I'm writing this from and I've not seen any trouble
   with this box at all, it's a 2.6.18-xen box.
  
  OK... Are you able to reproduce the problem reliably?
  
  If so, can you provide me with a binary tcpdump or wireshark dump? If
  using tcpdump, then please use something like
  
 tcpdump -w /tmp/dump.out -s 9 host myserver.foo.bar and port 2049
  
  Please also try to provide a netstat dump of the current TCP connections
  as soon as the hang occurs:
  
netstat -t
 
 Aug 24 18:08:59 iranon kernel: [168839.556017] nfs: server hopkins not 
 responding, still trying
 but I wasn't around until 19:38 to spot it.
 
 netstat when I got to it was:
 
 Proto Recv-Q Send-Q Local Address   Foreign Address State 
  
 tcp0  0 localhost.localdo:50891 localhost.localdom:6543 
 ESTABLISHED
 tcp1  0 iranon.hellion.org.:ssh azathoth.hellion.:52682 
 CLOSE_WAIT 
 tcp0  0 localhost.localdom:6543 localhost.localdo:50893 
 ESTABLISHED
 tcp0  0 iranon.hellion.org.:837 hopkins.hellion.org:nfs FIN_WAIT2 
  
 tcp0  0 localhost.localdom:6543 localhost.localdo:41831 
 ESTABLISHED
 tcp0  0 localhost.localdo:13666 localhost.localdo:59482 
 ESTABLISHED
 tcp0  0 localhost.localdo:34288 localhost.localdom:6545 
 ESTABLISHED
 tcp0  0 iranon.hellion.org.:ssh azathoth.hellion.:48977 
 ESTABLISHED
 tcp0  0 iranon.hellion.org.:ssh azathoth.hellion.:52683 
 ESTABLISHED
 tcp0  0 localhost.localdom:6545 localhost.localdo:34288 
 ESTABLISHED
 tcp0  0 localhost.localdom:6543 localhost.localdo:50891 
 ESTABLISHED
 tcp0  0 localhost.localdo:50893 localhost.localdom:6543 
 ESTABLISHED
 tcp0  0 localhost.localdo:41831 localhost.localdom:6543 
 ESTABLISHED
 tcp0 87 localhost.localdo:59482 localhost.localdo:13666 
 ESTABLISHED
 tcp1  0 localhost.localdom:6543 localhost.localdo:41830 
 CLOSE_WAIT 
 
 (iranon is the problematic host .4, azathoth is my desktop machine .5, 
 hopkins is the NFS server .6)
 
 tcpdumps are pretty big. I've attached the last 100 packets captured. If
 you need more I can put the full file up somewhere.
 
 -rw-r--r-- 1 root   root   1.3G Aug 24 17:57 dump.out0
 -rw-r--r-- 1 root   root   536M Aug 24 19:38 dump.out1
 
 Ian.

From the tcpdump, it looks as if the NFS server is failing to close the
socket, when the client closes its side. You therefore end up getting
stuck in the FIN_WAIT2 state (as netstat clearly shows above).

Is the server keeping the client in this state for a very long period?

Cheers
  Trond


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-24 Thread Trond Myklebust
On Sun, 2008-08-24 at 23:09 +0100, Ian Campbell wrote:
 (added some quoting from previous mail to save replying twice)
 
 On Sun, 2008-08-24 at 15:19 -0400, Trond Myklebust wrote:
  On Sun, 2008-08-24 at 15:17 -0400, Trond Myklebust wrote:
   From the tcpdump, it looks as if the NFS server is failing to close the
   socket, when the client closes its side. You therefore end up getting
   stuck in the FIN_WAIT2 state (as netstat clearly shows above).
  
   Is the server keeping the client in this state for a very long
   period?
 
 Well, it had been around an hour and a half on this occasion. Next time
 it happens I can wait longer but I'm pretty sure I've come back from
 time away and it's been wedged for at least a day. How long would you
 expect it to remain in this state for?

The server should ideally start to close the socket as soon as it
receives the FIN from the client. I'll have a look at the code.

Trond


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-22 Thread John Ronciak
On Fri, Aug 22, 2008 at 11:13 AM, Ian Campbell [EMAIL PROTECTED] wrote:

 That's probably also a networking device driver issue candidate: your
 RPC task is queued up waiting to be sent.

 What networking card+device driver are you using here?

 # ethtool -i eth0
 driver: e1000
 version: 7.3.20-k2-NAPI
 firmware-version: N/A
 bus-info: :01:0a.0
There is nothing indicating that the NIC/driver is causing any sort of
problem here, at least not with what has been presented so far.  When
the NFS mount isn't working is the networking still active and
working?

-- 
Cheers,
John

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-22 Thread John Ronciak
On Fri, Aug 22, 2008 at 1:00 PM, Ian Campbell [EMAIL PROTECTED] wrote:

 So far as I can tell, yes. I can login via ssh so long as the user
 doesn't have NFS $HOME, I haven't tried much else and the box isn't
 locked up at the moment, I'd bet it's fine though.
Then it's highly unlikely that the NIC/driver is causing your problem.
 We'll continue to monitor this thread waiting for any info that this
is somehow NIC/driver related.

Good luck.



-- 
Cheers,
John

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-22 Thread Trond Myklebust
On Fri, 2008-08-22 at 21:00 +0100, Ian Campbell wrote:
 On Fri, 2008-08-22 at 12:33 -0700, John Ronciak wrote:
  On Fri, Aug 22, 2008 at 11:13 AM, Ian Campbell [EMAIL PROTECTED] wrote:
  
   That's probably also a networking device driver issue candidate: your
   RPC task is queued up waiting to be sent.
  
   What networking card+device driver are you using here?
  
   # ethtool -i eth0
   driver: e1000
   version: 7.3.20-k2-NAPI
   firmware-version: N/A
   bus-info: :01:0a.0
  There is nothing indicating that the NIC/driver is causing any sort of
  problem here, at least not with what has been presented so far.  When
  the NFS mount isn't working is the networking still active and
  working?
 
 So far as I can tell, yes. I can login via ssh so long as the user
 doesn't have NFS $HOME, I haven't tried much else and the box isn't
 locked up at the moment, I'd bet it's fine though.

...and the server? Something is preventing that RPC payload from being
delivered...

Trond


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-22 Thread Trond Myklebust
On Fri, 2008-08-22 at 22:37 +0100, Ian Campbell wrote:
 I can ssh to the server fine. The same server also serves my NFS home
 directory to the box I'm writing this from and I've not seen any trouble
 with this box at all, it's a 2.6.18-xen box.

OK... Are you able to reproduce the problem reliably?

If so, can you provide me with a binary tcpdump or wireshark dump? If
using tcpdump, then please use something like

   tcpdump -w /tmp/dump.out -s 9 host myserver.foo.bar and port 2049

Please also try to provide a netstat dump of the current TCP connections
as soon as the hang occurs:

  netstat -t

Cheers
  Trond


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel