On Sat, 12 Jan 2013 10:28:01 -0800
John Darrah <xyl...@gmail.com> wrote:

> On Fri, Jan 11, 2013 at 08:27:16AM -0500, Jeff Layton wrote:
> > On Thu, 10 Jan 2013 20:29:43 -0800
> > John Darrah <xyl...@gmail.com> wrote:
> > 
> > > On Fri, Jan 04, 2013 at 07:09:33AM -0500, Jeff Layton wrote:
> > > > On Thu, 3 Jan 2013 21:29:22 -0800
> > > > John Darrah <xyl...@gmail.com> wrote:
> > > > 
> > > > > On Sat, Dec 29, 2012 at 12:26:07PM +0100, Ben Hutchings wrote:
> > > > > > On Fri, 2012-12-28 at 22:01 -0500, Jeff Layton wrote:
> > > > > > > On Sat, 29 Dec 2012 01:24:36 +0100
> > > > > > > Ben Hutchings <b...@decadent.org.uk> wrote:
> > > > > > > 
> > > > > > > > On Mon, 2012-12-24 at 09:14 -0500, Jeff Layton wrote:
> > > > > > > > > On Sun, 23 Dec 2012 09:10:34 -0500
> > > > > > > > > Jeff Layton <jlay...@redhat.com> wrote:
> > > > > > > > [...]
> > > > > > > > > > I had a look at the code today and suspect that I know what 
> > > > > > > > > > the problem
> > > > > > > > > > is. When the kernel goes to send a request, it first signs 
> > > > > > > > > > it and then
> > > > > > > > > > bumps the sequence numbers that it tracks. If the request 
> > > > > > > > > > doesn't
> > > > > > > > > > actually make it out onto the wire, like when the task 
> > > > > > > > > > catches a
> > > > > > > > > > signal, those sequence numbers remain high even though the 
> > > > > > > > > > request
> > > > > > > > > > didn't go out.
> > > > > > > > > > 
> > > > > > > > > > Here's an untested patch that might help tell whether this 
> > > > > > > > > > is the
> > > > > > > > > > case. You may want to try it and see if it does. Note that 
> > > > > > > > > > this fix is
> > > > > > > > > > a bit of a kludge and is not suitable for merging!
> > > > > > > > > > 
> > > > > > > > > > A better fix would involve changing when the sequence 
> > > > > > > > > > number gets
> > > > > > > > > > bumped in the first place. If this patch seems to help 
> > > > > > > > > > things, then
> > > > > > > > > > I'll look at coding up that up.
> > > > > > > > [...]
> > > > > > > > > I was able to reproduce this, and I don't think the above 
> > > > > > > > > patch will
> > > > > > > > > fix it (at least not completely). The problem seems to be 
> > > > > > > > > that the NT
> > > > > > > > > cancel command is screwing up the sequence numbers. We'll 
> > > > > > > > > have to do
> > > > > > > > > some research to figure out why that's occurring.
> > > > > > > > 
> > > > > > > > Jeff, we got a bug report in Debian which seems to be the same 
> > > > > > > > problem:
> > > > > > > > <http://bugs.debian.org/695492>.  Please cc John Darrah and the 
> > > > > > > > bug
> > > > > > > > address as above.
> > > > > > > > 
> > > > > > > > Ben.
> > > > > > > > 
> > > > > > > 
> > > > > > > You may want to try this patch. It seems to fix the problem for 
> > > > > > > me, but
> > > > > > > I think there is probably some more work to do in this area.
> > > > > > > 
> > > > > > > http://www.spinics.net/lists/linux-cifs/msg07576.html
> > > > > > > 
> > > > > > 
> > > > > > John, you can test this patch by following instructions at
> > > > > > <http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official>.
> > > > > > 
> > > > > > Please reply-to-all to Jeff's message when you have a result.
> > > > > > 
> > > > > > Ben.
> > > > > > 
> > > > > 
> > > 
> > > OK... I built a 3.2.35 kernel with the patch to transport.c 
> > > and also a 3.7.1 with the patch to smb1ops.c and loaded them 
> > > into my wheezy VM. I tested both by starting commands to 
> > > frob the CIFS mounts and then typing a CTRL-C to kill the 
> > > command, and they were stable (at least 50 attempts using 
> > > each kernel with the CTRL-C fired at random times into the 
> > > running command).
> > > 
> > > But... now another issue affects both kernels. It seems that 
> > > after 10 to 15 minutes of non use, the mount hangs and the 
> > > command accessing the mount can only be killed with a 
> > > SIGKILL... but only sometimes. Sometimes only a reboot 
> > > would unwedge things.
> > > 
> > > It seems when the mount would hang, I would get the:
> > >   CIFS VFS: Server amifile01 has not responded in 300 seconds. 
> > > Reconnecting...
> > > error except the 3.7 kernel reported 120 seconds instead of 
> > > the 300 seconds noted above.
> > > 
> > 
> > Interesting, I haven't noticed that issue, but I'll try to reproduce it
> > when I get a chance.
> > 
> 
> Is there a command or kernel magic the can force a dump to 
> see where the contention is that is causing the hang?
> 
> Also, I just tried starting the VM and mounting the CIFS 
> drives and then just letting it sit there without running 
> anything to touch the drives.... they still hang. So this 
> means the CTRL-C thing has nothing to do with it.
> 

Ok, so it sounds like the original bug is now fixed with the patch I
proposed. This other thing sounds like it warrants a new bug. When you
say it hangs, does the whole box hang or is it just processes that
touch the cifs mount?

If you know the pid of the hung process, you can look at
/proc/<pid>/stack to see what it's doing. There are also things like
sysrq-t. You can also set up kdump and force a crash on a machine to
get a coredump, and then try to analyze it to figure out why it's hung.

-- 
Jeff Layton <jlay...@redhat.com>


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to