Bug#695492: CIFS mount fails if I ctrl-c a long-running find process (Linux mounting Windows share)

2013-01-25 Thread John Darrah
On Sun, Jan 13, 2013 at 05:46:35AM -0500, Jeff Layton wrote:
 On Sat, 12 Jan 2013 10:28:01 -0800
 John Darrah xyl...@gmail.com wrote:
 
  
  Is there a command or kernel magic the can force a dump to 
  see where the contention is that is causing the hang?
  
  Also, I just tried starting the VM and mounting the CIFS 
  drives and then just letting it sit there without running 
  anything to touch the drives they still hang. So this 
  means the CTRL-C thing has nothing to do with it.
  
 
 Ok, so it sounds like the original bug is now fixed with the patch I
 proposed. This other thing sounds like it warrants a new bug. When you
 say it hangs, does the whole box hang or is it just processes that
 touch the cifs mount?
 
 If you know the pid of the hung process, you can look at
 /proc/pid/stack to see what it's doing. There are also things like
 sysrq-t. You can also set up kdump and force a crash on a machine to
 get a coredump, and then try to analyze it to figure out why it's hung.
 

I've have looked at this several times, but all I can come 
up with is the contents of /proc/pid/stack. The is an 'ls' 
command that is waiting for something. I can see some CIFS 
stuff but I have no idea what i'm looking at. This was taken 
after about 30 minutes in the hung state.


[c11fee0d] kernel_setsockopt+0x34/0x46
[f86925c7] smb_send_rqst+0x107/0x170 [cifs]
[c1035b66] prepare_to_wait+0x12/0x37
[f86922d8] wait_for_response.isra.8+0x6d/0xc2 [cifs]
[c1035af9] autoremove_wake_function+0x0/0x29
[f8692c7a] SendReceive+0x141/0x1f1 [cifs]
[f867b793] CIFSSMBNegotiate+0x17c/0x6bf [cifs]
[f8697fc3] cifs_negotiate+0xb/0x31 [cifs]
[f8686557] cifs_negotiate_protocol+0x3b/0x62 [cifs]
[f867b471] cifs_reconnect_tcon+0x16f/0x235 [cifs]
[c10870ff] prep_new_page+0xac/0xe0
[f867b550] smb_init+0x19/0x58 [cifs]
[f867f815] CIFSSMBQPathInfo+0x4c/0x1e2 [cifs]
[f8697eb4] cifs_query_path_info+0x26/0x5a [cifs]
[f868e327] cifs_get_inode_info+0x10d/0x4a1 [cifs]
[c10a9e6a] __kmalloc+0x8d/0x99
[f86875e9] build_path_from_dentry+0xab/0x182 [cifs]
[f868761b] build_path_from_dentry+0xdd/0x182 [cifs]
[f868f8e2] cifs_revalidate_dentry_attr+0xd7/0x131 [cifs]
[f868f965] cifs_revalidate_dentry+0x9/0x1d [cifs]
[f8687497] cifs_d_revalidate+0x13/0x6e [cifs]
[c10b5c84] d_revalidate+0x5/0x6
[c10b6922] lookup_fast+0x169/0x1ed
[c10b6c13] walk_component+0x2e/0x144
[c10b7288] link_path_walk+0x32c/0x3ca
[c10b764f] path_lookupat+0x4d/0x251
[c10b7872] filename_lookup+0x1f/0x6c
[c10b93bf] user_path_at_empty+0x59/0x81
[c10b6201] vfs_readlink+0x2d/0x3c
[c10b6256] generic_readlink+0x46/0x6a
[c10b93f2] user_path_at+0xb/0xe
[c10b2d24] vfs_fstatat+0x33/0x61
[c10b2d77] vfs_stat+0x10/0x12
[c10b31e5] sys_stat64+0xe/0x21
[c10bd157] dput+0x16/0x96
[c10b31af] sys_readlinkat+0x82/0x93
[c10b31d3] sys_readlink+0x13/0x17
[c12a2a7f] syscall_call+0x7/0xb
[] 0x


Sorry I can't be more help.

-- john


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#695492: CIFS mount fails if I ctrl-c a long-running find process (Linux mounting Windows share)

2013-01-13 Thread Jeff Layton
On Sat, 12 Jan 2013 10:28:01 -0800
John Darrah xyl...@gmail.com wrote:

 On Fri, Jan 11, 2013 at 08:27:16AM -0500, Jeff Layton wrote:
  On Thu, 10 Jan 2013 20:29:43 -0800
  John Darrah xyl...@gmail.com wrote:
  
   On Fri, Jan 04, 2013 at 07:09:33AM -0500, Jeff Layton wrote:
On Thu, 3 Jan 2013 21:29:22 -0800
John Darrah xyl...@gmail.com wrote:

 On Sat, Dec 29, 2012 at 12:26:07PM +0100, Ben Hutchings wrote:
  On Fri, 2012-12-28 at 22:01 -0500, Jeff Layton wrote:
   On Sat, 29 Dec 2012 01:24:36 +0100
   Ben Hutchings b...@decadent.org.uk wrote:
   
On Mon, 2012-12-24 at 09:14 -0500, Jeff Layton wrote:
 On Sun, 23 Dec 2012 09:10:34 -0500
 Jeff Layton jlay...@redhat.com wrote:
[...]
  I had a look at the code today and suspect that I know what 
  the problem
  is. When the kernel goes to send a request, it first signs 
  it and then
  bumps the sequence numbers that it tracks. If the request 
  doesn't
  actually make it out onto the wire, like when the task 
  catches a
  signal, those sequence numbers remain high even though the 
  request
  didn't go out.
  
  Here's an untested patch that might help tell whether this 
  is the
  case. You may want to try it and see if it does. Note that 
  this fix is
  a bit of a kludge and is not suitable for merging!
  
  A better fix would involve changing when the sequence 
  number gets
  bumped in the first place. If this patch seems to help 
  things, then
  I'll look at coding up that up.
[...]
 I was able to reproduce this, and I don't think the above 
 patch will
 fix it (at least not completely). The problem seems to be 
 that the NT
 cancel command is screwing up the sequence numbers. We'll 
 have to do
 some research to figure out why that's occurring.

Jeff, we got a bug report in Debian which seems to be the same 
problem:
http://bugs.debian.org/695492.  Please cc John Darrah and the 
bug
address as above.

Ben.

   
   You may want to try this patch. It seems to fix the problem for 
   me, but
   I think there is probably some more work to do in this area.
   
   http://www.spinics.net/lists/linux-cifs/msg07576.html
   
  
  John, you can test this patch by following instructions at
  http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official.
  
  Please reply-to-all to Jeff's message when you have a result.
  
  Ben.
  
 
   
   OK... I built a 3.2.35 kernel with the patch to transport.c 
   and also a 3.7.1 with the patch to smb1ops.c and loaded them 
   into my wheezy VM. I tested both by starting commands to 
   frob the CIFS mounts and then typing a CTRL-C to kill the 
   command, and they were stable (at least 50 attempts using 
   each kernel with the CTRL-C fired at random times into the 
   running command).
   
   But... now another issue affects both kernels. It seems that 
   after 10 to 15 minutes of non use, the mount hangs and the 
   command accessing the mount can only be killed with a 
   SIGKILL... but only sometimes. Sometimes only a reboot 
   would unwedge things.
   
   It seems when the mount would hang, I would get the:
 CIFS VFS: Server amifile01 has not responded in 300 seconds. 
   Reconnecting...
   error except the 3.7 kernel reported 120 seconds instead of 
   the 300 seconds noted above.
   
  
  Interesting, I haven't noticed that issue, but I'll try to reproduce it
  when I get a chance.
  
 
 Is there a command or kernel magic the can force a dump to 
 see where the contention is that is causing the hang?
 
 Also, I just tried starting the VM and mounting the CIFS 
 drives and then just letting it sit there without running 
 anything to touch the drives they still hang. So this 
 means the CTRL-C thing has nothing to do with it.
 

Ok, so it sounds like the original bug is now fixed with the patch I
proposed. This other thing sounds like it warrants a new bug. When you
say it hangs, does the whole box hang or is it just processes that
touch the cifs mount?

If you know the pid of the hung process, you can look at
/proc/pid/stack to see what it's doing. There are also things like
sysrq-t. You can also set up kdump and force a crash on a machine to
get a coredump, and then try to analyze it to figure out why it's hung.

-- 
Jeff Layton jlay...@redhat.com


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#695492: CIFS mount fails if I ctrl-c a long-running find process (Linux mounting Windows share)

2013-01-13 Thread John Darrah

On 1/13/2013 2:46 AM, Jeff Layton wrote:

On Sat, 12 Jan 2013 10:28:01 -0800
John Darrah xyl...@gmail.com wrote:


On Fri, Jan 11, 2013 at 08:27:16AM -0500, Jeff Layton wrote:

On Thu, 10 Jan 2013 20:29:43 -0800
John Darrah xyl...@gmail.com wrote:


On Fri, Jan 04, 2013 at 07:09:33AM -0500, Jeff Layton wrote:

On Thu, 3 Jan 2013 21:29:22 -0800
John Darrah xyl...@gmail.com wrote:


On Sat, Dec 29, 2012 at 12:26:07PM +0100, Ben Hutchings wrote:

On Fri, 2012-12-28 at 22:01 -0500, Jeff Layton wrote:

On Sat, 29 Dec 2012 01:24:36 +0100
Ben Hutchings b...@decadent.org.uk wrote:


On Mon, 2012-12-24 at 09:14 -0500, Jeff Layton wrote:

On Sun, 23 Dec 2012 09:10:34 -0500
Jeff Layton jlay...@redhat.com wrote:

[...]

I had a look at the code today and suspect that I know what the problem
is. When the kernel goes to send a request, it first signs it and then
bumps the sequence numbers that it tracks. If the request doesn't
actually make it out onto the wire, like when the task catches a
signal, those sequence numbers remain high even though the request
didn't go out.

Here's an untested patch that might help tell whether this is the
case. You may want to try it and see if it does. Note that this fix is
a bit of a kludge and is not suitable for merging!

A better fix would involve changing when the sequence number gets
bumped in the first place. If this patch seems to help things, then
I'll look at coding up that up.

[...]

I was able to reproduce this, and I don't think the above patch will
fix it (at least not completely). The problem seems to be that the NT
cancel command is screwing up the sequence numbers. We'll have to do
some research to figure out why that's occurring.

Jeff, we got a bug report in Debian which seems to be the same problem:
http://bugs.debian.org/695492.  Please cc John Darrah and the bug
address as above.

Ben.


You may want to try this patch. It seems to fix the problem for me, but
I think there is probably some more work to do in this area.

http://www.spinics.net/lists/linux-cifs/msg07576.html


John, you can test this patch by following instructions at
http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official.

Please reply-to-all to Jeff's message when you have a result.

Ben.


OK... I built a 3.2.35 kernel with the patch to transport.c
and also a 3.7.1 with the patch to smb1ops.c and loaded them
into my wheezy VM. I tested both by starting commands to
frob the CIFS mounts and then typing a CTRL-C to kill the
command, and they were stable (at least 50 attempts using
each kernel with the CTRL-C fired at random times into the
running command).

But... now another issue affects both kernels. It seems that
after 10 to 15 minutes of non use, the mount hangs and the
command accessing the mount can only be killed with a
SIGKILL... but only sometimes. Sometimes only a reboot
would unwedge things.

It seems when the mount would hang, I would get the:
   CIFS VFS: Server amifile01 has not responded in 300 seconds. Reconnecting...
error except the 3.7 kernel reported 120 seconds instead of
the 300 seconds noted above.


Interesting, I haven't noticed that issue, but I'll try to reproduce it
when I get a chance.


Is there a command or kernel magic the can force a dump to
see where the contention is that is causing the hang?

Also, I just tried starting the VM and mounting the CIFS
drives and then just letting it sit there without running
anything to touch the drives they still hang. So this
means the CTRL-C thing has nothing to do with it.


Ok, so it sounds like the original bug is now fixed with the patch I
proposed. This other thing sounds like it warrants a new bug. When you
say it hangs, does the whole box hang or is it just processes that
touch the cifs mount?

Yes, only the processes that touch the mount hang. I if make several
attempts at using SIGKILL, I can sometimes make the hung processes
die. Then I can unmount and remount the drives and they seem OK
until they hang again.



If you know the pid of the hung process, you can look at
/proc/pid/stack to see what it's doing. There are also things like
sysrq-t. You can also set up kdump and force a crash on a machine to
get a coredump, and then try to analyze it to figure out why it's hung.


I will attempt to get some useful info from one of the above suggestions.

-- john


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#695492: CIFS mount fails if I ctrl-c a long-running find process (Linux mounting Windows share)

2013-01-12 Thread John Darrah
On Fri, Jan 11, 2013 at 08:27:16AM -0500, Jeff Layton wrote:
 On Thu, 10 Jan 2013 20:29:43 -0800
 John Darrah xyl...@gmail.com wrote:
 
  On Fri, Jan 04, 2013 at 07:09:33AM -0500, Jeff Layton wrote:
   On Thu, 3 Jan 2013 21:29:22 -0800
   John Darrah xyl...@gmail.com wrote:
   
On Sat, Dec 29, 2012 at 12:26:07PM +0100, Ben Hutchings wrote:
 On Fri, 2012-12-28 at 22:01 -0500, Jeff Layton wrote:
  On Sat, 29 Dec 2012 01:24:36 +0100
  Ben Hutchings b...@decadent.org.uk wrote:
  
   On Mon, 2012-12-24 at 09:14 -0500, Jeff Layton wrote:
On Sun, 23 Dec 2012 09:10:34 -0500
Jeff Layton jlay...@redhat.com wrote:
   [...]
 I had a look at the code today and suspect that I know what 
 the problem
 is. When the kernel goes to send a request, it first signs it 
 and then
 bumps the sequence numbers that it tracks. If the request 
 doesn't
 actually make it out onto the wire, like when the task 
 catches a
 signal, those sequence numbers remain high even though the 
 request
 didn't go out.
 
 Here's an untested patch that might help tell whether this is 
 the
 case. You may want to try it and see if it does. Note that 
 this fix is
 a bit of a kludge and is not suitable for merging!
 
 A better fix would involve changing when the sequence number 
 gets
 bumped in the first place. If this patch seems to help 
 things, then
 I'll look at coding up that up.
   [...]
I was able to reproduce this, and I don't think the above patch 
will
fix it (at least not completely). The problem seems to be that 
the NT
cancel command is screwing up the sequence numbers. We'll have 
to do
some research to figure out why that's occurring.
   
   Jeff, we got a bug report in Debian which seems to be the same 
   problem:
   http://bugs.debian.org/695492.  Please cc John Darrah and the 
   bug
   address as above.
   
   Ben.
   
  
  You may want to try this patch. It seems to fix the problem for me, 
  but
  I think there is probably some more work to do in this area.
  
  http://www.spinics.net/lists/linux-cifs/msg07576.html
  
 
 John, you can test this patch by following instructions at
 http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official.
 
 Please reply-to-all to Jeff's message when you have a result.
 
 Ben.
 

  
  OK... I built a 3.2.35 kernel with the patch to transport.c 
  and also a 3.7.1 with the patch to smb1ops.c and loaded them 
  into my wheezy VM. I tested both by starting commands to 
  frob the CIFS mounts and then typing a CTRL-C to kill the 
  command, and they were stable (at least 50 attempts using 
  each kernel with the CTRL-C fired at random times into the 
  running command).
  
  But... now another issue affects both kernels. It seems that 
  after 10 to 15 minutes of non use, the mount hangs and the 
  command accessing the mount can only be killed with a 
  SIGKILL... but only sometimes. Sometimes only a reboot 
  would unwedge things.
  
  It seems when the mount would hang, I would get the:
CIFS VFS: Server amifile01 has not responded in 300 seconds. 
  Reconnecting...
  error except the 3.7 kernel reported 120 seconds instead of 
  the 300 seconds noted above.
  
 
 Interesting, I haven't noticed that issue, but I'll try to reproduce it
 when I get a chance.
 

Is there a command or kernel magic the can force a dump to 
see where the contention is that is causing the hang?

Also, I just tried starting the VM and mounting the CIFS 
drives and then just letting it sit there without running 
anything to touch the drives they still hang. So this 
means the CTRL-C thing has nothing to do with it.


-- john


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#695492: CIFS mount fails if I ctrl-c a long-running find process (Linux mounting Windows share)

2013-01-11 Thread Jeff Layton
On Thu, 10 Jan 2013 20:29:43 -0800
John Darrah xyl...@gmail.com wrote:

 On Fri, Jan 04, 2013 at 07:09:33AM -0500, Jeff Layton wrote:
  On Thu, 3 Jan 2013 21:29:22 -0800
  John Darrah xyl...@gmail.com wrote:
  
   On Sat, Dec 29, 2012 at 12:26:07PM +0100, Ben Hutchings wrote:
On Fri, 2012-12-28 at 22:01 -0500, Jeff Layton wrote:
 On Sat, 29 Dec 2012 01:24:36 +0100
 Ben Hutchings b...@decadent.org.uk wrote:
 
  On Mon, 2012-12-24 at 09:14 -0500, Jeff Layton wrote:
   On Sun, 23 Dec 2012 09:10:34 -0500
   Jeff Layton jlay...@redhat.com wrote:
  [...]
I had a look at the code today and suspect that I know what the 
problem
is. When the kernel goes to send a request, it first signs it 
and then
bumps the sequence numbers that it tracks. If the request 
doesn't
actually make it out onto the wire, like when the task catches a
signal, those sequence numbers remain high even though the 
request
didn't go out.

Here's an untested patch that might help tell whether this is 
the
case. You may want to try it and see if it does. Note that this 
fix is
a bit of a kludge and is not suitable for merging!

A better fix would involve changing when the sequence number 
gets
bumped in the first place. If this patch seems to help things, 
then
I'll look at coding up that up.
  [...]
   I was able to reproduce this, and I don't think the above patch 
   will
   fix it (at least not completely). The problem seems to be that 
   the NT
   cancel command is screwing up the sequence numbers. We'll have to 
   do
   some research to figure out why that's occurring.
  
  Jeff, we got a bug report in Debian which seems to be the same 
  problem:
  http://bugs.debian.org/695492.  Please cc John Darrah and the bug
  address as above.
  
  Ben.
  
 
 You may want to try this patch. It seems to fix the problem for me, 
 but
 I think there is probably some more work to do in this area.
 
 http://www.spinics.net/lists/linux-cifs/msg07576.html
 

John, you can test this patch by following instructions at
http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official.

Please reply-to-all to Jeff's message when you have a result.

Ben.

   
 
 OK... I built a 3.2.35 kernel with the patch to transport.c 
 and also a 3.7.1 with the patch to smb1ops.c and loaded them 
 into my wheezy VM. I tested both by starting commands to 
 frob the CIFS mounts and then typing a CTRL-C to kill the 
 command, and they were stable (at least 50 attempts using 
 each kernel with the CTRL-C fired at random times into the 
 running command).
 
 But... now another issue affects both kernels. It seems that 
 after 10 to 15 minutes of non use, the mount hangs and the 
 command accessing the mount can only be killed with a 
 SIGKILL... but only sometimes. Sometimes only a reboot 
 would unwedge things.
 
 It seems when the mount would hang, I would get the:
   CIFS VFS: Server amifile01 has not responded in 300 seconds. Reconnecting...
 error except the 3.7 kernel reported 120 seconds instead of 
 the 300 seconds noted above.
 

Interesting, I haven't noticed that issue, but I'll try to reproduce it
when I get a chance.

 Below is one of the kernel logs after I SIGKILL'd things... 
 it looks like I trigered a fault of some kind. Maybe it has 
 some meaning (this log only happened once).
 

Hmmm...

Looks like a problem in the virtualbox code. Certainly doesn't appear
to be cifs-related. It seems like we saw something similar when all of
the lockless dcache stuff went upstream, so it may be that the vbox
stuff needs to be forward-ported to handle that correctly.

 -- john
 
 
 Jan  7 07:06:34 jax kernel: imklog 5.8.11, log source = /proc/kmsg started.
 Jan  7 07:06:34 jax kernel: [0.00] Initializing cgroup subsys cpuset
 Jan  7 07:06:34 jax kernel: [0.00] Initializing cgroup subsys cpu
 Jan  7 07:06:34 jax kernel: [0.00] Linux version 3.2.0-4-486 
 (debian-ker...@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 
 Debian 3.2.35-2
 
 -a bunch removed-
 
 Jan  7 08:30:31 jax kernel: [   17.072068] eth0: no IPv6 routers present
 Jan  7 08:31:17 jax kernel: [   63.273900] FS-Cache: Netfs 'cifs' registered 
 for caching
 Jan  7 08:31:17 jax kernel: [   63.304164] CIFS VFS: default security 
 mechanism requested.  The default security mechanism will be upgraded from 
 ntlm to ntlmv2 in kernel release 3.3
 Jan  7 08:51:20 jax kernel: [ 1266.602096] CIFS VFS: Server amifile01 has not 
 responded in 300 seconds. Reconnecting...
 Jan  7 08:51:20 jax kernel: [ 1266.602347] CIFS VFS: Server amifile02 has not 
 responded in 300 seconds. Reconnecting...
 Jan  7 09:06:57 jax kernel: [ 2203.298637] 

Bug#695492: CIFS mount fails if I ctrl-c a long-running find process (Linux mounting Windows share)

2013-01-10 Thread John Darrah
On Fri, Jan 04, 2013 at 07:09:33AM -0500, Jeff Layton wrote:
 On Thu, 3 Jan 2013 21:29:22 -0800
 John Darrah xyl...@gmail.com wrote:
 
  On Sat, Dec 29, 2012 at 12:26:07PM +0100, Ben Hutchings wrote:
   On Fri, 2012-12-28 at 22:01 -0500, Jeff Layton wrote:
On Sat, 29 Dec 2012 01:24:36 +0100
Ben Hutchings b...@decadent.org.uk wrote:

 On Mon, 2012-12-24 at 09:14 -0500, Jeff Layton wrote:
  On Sun, 23 Dec 2012 09:10:34 -0500
  Jeff Layton jlay...@redhat.com wrote:
 [...]
   I had a look at the code today and suspect that I know what the 
   problem
   is. When the kernel goes to send a request, it first signs it and 
   then
   bumps the sequence numbers that it tracks. If the request doesn't
   actually make it out onto the wire, like when the task catches a
   signal, those sequence numbers remain high even though the request
   didn't go out.
   
   Here's an untested patch that might help tell whether this is the
   case. You may want to try it and see if it does. Note that this 
   fix is
   a bit of a kludge and is not suitable for merging!
   
   A better fix would involve changing when the sequence number gets
   bumped in the first place. If this patch seems to help things, 
   then
   I'll look at coding up that up.
 [...]
  I was able to reproduce this, and I don't think the above patch will
  fix it (at least not completely). The problem seems to be that the 
  NT
  cancel command is screwing up the sequence numbers. We'll have to do
  some research to figure out why that's occurring.
 
 Jeff, we got a bug report in Debian which seems to be the same 
 problem:
 http://bugs.debian.org/695492.  Please cc John Darrah and the bug
 address as above.
 
 Ben.
 

You may want to try this patch. It seems to fix the problem for me, but
I think there is probably some more work to do in this area.

http://www.spinics.net/lists/linux-cifs/msg07576.html

   
   John, you can test this patch by following instructions at
   http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official.
   
   Please reply-to-all to Jeff's message when you have a result.
   
   Ben.
   
  

OK... I built a 3.2.35 kernel with the patch to transport.c 
and also a 3.7.1 with the patch to smb1ops.c and loaded them 
into my wheezy VM. I tested both by starting commands to 
frob the CIFS mounts and then typing a CTRL-C to kill the 
command, and they were stable (at least 50 attempts using 
each kernel with the CTRL-C fired at random times into the 
running command).

But... now another issue affects both kernels. It seems that 
after 10 to 15 minutes of non use, the mount hangs and the 
command accessing the mount can only be killed with a 
SIGKILL... but only sometimes. Sometimes only a reboot 
would unwedge things.

It seems when the mount would hang, I would get the:
  CIFS VFS: Server amifile01 has not responded in 300 seconds. Reconnecting...
error except the 3.7 kernel reported 120 seconds instead of 
the 300 seconds noted above.

Below is one of the kernel logs after I SIGKILL'd things... 
it looks like I trigered a fault of some kind. Maybe it has 
some meaning (this log only happened once).

-- john


Jan  7 07:06:34 jax kernel: imklog 5.8.11, log source = /proc/kmsg started.
Jan  7 07:06:34 jax kernel: [0.00] Initializing cgroup subsys cpuset
Jan  7 07:06:34 jax kernel: [0.00] Initializing cgroup subsys cpu
Jan  7 07:06:34 jax kernel: [0.00] Linux version 3.2.0-4-486 
(debian-ker...@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 
Debian 3.2.35-2

-a bunch removed-

Jan  7 08:30:31 jax kernel: [   17.072068] eth0: no IPv6 routers present
Jan  7 08:31:17 jax kernel: [   63.273900] FS-Cache: Netfs 'cifs' registered 
for caching
Jan  7 08:31:17 jax kernel: [   63.304164] CIFS VFS: default security mechanism 
requested.  The default security mechanism will be upgraded from ntlm to ntlmv2 
in kernel release 3.3
Jan  7 08:51:20 jax kernel: [ 1266.602096] CIFS VFS: Server amifile01 has not 
responded in 300 seconds. Reconnecting...
Jan  7 08:51:20 jax kernel: [ 1266.602347] CIFS VFS: Server amifile02 has not 
responded in 300 seconds. Reconnecting...
Jan  7 09:06:57 jax kernel: [ 2203.298637] [ cut here ]
Jan  7 09:06:57 jax kernel: [ 2203.298645] WARNING: at 
/root/linux-3.2.35/fs/dcache.c:1291 d_set_d_op+0x24/0x85()
Jan  7 09:06:57 jax kernel: [ 2203.298648] Hardware name: VirtualBox
Jan  7 09:06:57 jax kernel: [ 2203.298651] Modules linked in: des_generic ecb 
md4 hmac nls_utf8 cifs vboxsf(O) nfsd nfs nfs_acl auth_rpcgss fscache lockd 
sunrpc loop snd_intel8x0 snd_ac97_codec snd_pcsp snd_pcm snd_page_alloc 
snd_timer psmouse joydev parport_pc parport usbhid snd hid vboxguest(O) evdev 
serio_raw battery ac ac97_bus soundcore button ext4 crc16 jbd2 mbcache sg 
sr_mod sd_mod 

Bug#695492: CIFS mount fails if I ctrl-c a long-running find process (Linux mounting Windows share)

2012-12-28 Thread Ben Hutchings
On Mon, 2012-12-24 at 09:14 -0500, Jeff Layton wrote:
 On Sun, 23 Dec 2012 09:10:34 -0500
 Jeff Layton jlay...@redhat.com wrote:
[...]
  I had a look at the code today and suspect that I know what the problem
  is. When the kernel goes to send a request, it first signs it and then
  bumps the sequence numbers that it tracks. If the request doesn't
  actually make it out onto the wire, like when the task catches a
  signal, those sequence numbers remain high even though the request
  didn't go out.
  
  Here's an untested patch that might help tell whether this is the
  case. You may want to try it and see if it does. Note that this fix is
  a bit of a kludge and is not suitable for merging!
  
  A better fix would involve changing when the sequence number gets
  bumped in the first place. If this patch seems to help things, then
  I'll look at coding up that up.
[...]
 I was able to reproduce this, and I don't think the above patch will
 fix it (at least not completely). The problem seems to be that the NT
 cancel command is screwing up the sequence numbers. We'll have to do
 some research to figure out why that's occurring.

Jeff, we got a bug report in Debian which seems to be the same problem:
http://bugs.debian.org/695492.  Please cc John Darrah and the bug
address as above.

Ben.

-- 
Ben Hutchings
It is easier to change the specification to fit the program than vice versa.


signature.asc
Description: This is a digitally signed message part


Bug#695492: CIFS mount fails if I ctrl-c a long-running find process (Linux mounting Windows share)

2012-12-28 Thread Jeff Layton
On Sat, 29 Dec 2012 01:24:36 +0100
Ben Hutchings b...@decadent.org.uk wrote:

 On Mon, 2012-12-24 at 09:14 -0500, Jeff Layton wrote:
  On Sun, 23 Dec 2012 09:10:34 -0500
  Jeff Layton jlay...@redhat.com wrote:
 [...]
   I had a look at the code today and suspect that I know what the problem
   is. When the kernel goes to send a request, it first signs it and then
   bumps the sequence numbers that it tracks. If the request doesn't
   actually make it out onto the wire, like when the task catches a
   signal, those sequence numbers remain high even though the request
   didn't go out.
   
   Here's an untested patch that might help tell whether this is the
   case. You may want to try it and see if it does. Note that this fix is
   a bit of a kludge and is not suitable for merging!
   
   A better fix would involve changing when the sequence number gets
   bumped in the first place. If this patch seems to help things, then
   I'll look at coding up that up.
 [...]
  I was able to reproduce this, and I don't think the above patch will
  fix it (at least not completely). The problem seems to be that the NT
  cancel command is screwing up the sequence numbers. We'll have to do
  some research to figure out why that's occurring.
 
 Jeff, we got a bug report in Debian which seems to be the same problem:
 http://bugs.debian.org/695492.  Please cc John Darrah and the bug
 address as above.
 
 Ben.
 

You may want to try this patch. It seems to fix the problem for me, but
I think there is probably some more work to do in this area.

http://www.spinics.net/lists/linux-cifs/msg07576.html

-- 
Jeff Layton jlay...@redhat.com


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org