On Fri, 17 Apr 2009, Marc Dionne wrote:

On Fri, Apr 17, 2009 at 2:13 AM, Felix Frank <[email protected]> wrote:
With the fix above, my larger mmap test quickly runs into a deadlock
again. Looks like cache_write_pages is trying to lock the page that is
currently being written:

I think I just reproduced :/

Guess we're back to square one then. I posted a hack to RT #124627 yesterday
that does prevent deadlock, but apparently much data won't ever get written
to the cache and mmap_test reports corruptions (gets lots of 0s). So what to
do instead of osi_VM_StoreAllSegments() during partial writes?

Regards
 - Felix

I tried something almost identical yesterday (but I was lazier and
just used the AFS_VMSYNC_INVAL flag), and it does prevent the
deadlock.

I don't see any corruption however - could you post your test case
here (I had trouble getting it from AFS), and also, can you give a bit
more info - cache size, memory size, etc.  Kernel versions might make
a difference - I'm testing with latest 2.6.30-rc.

I had my little test program (misbehave.c) attached to my mail from 09:08:39 +0200. It's not the one that causes trouble with this workaround (or did I say it does? sorry, i'm starting to get a little confused.) On the test system I used, the problems start with mmap_test writing around 600MB or so (64MB disk cache). It does the same with vanilla 1.4.10, but I had a feeling that it happens "sooner" with the workaround in place.

If my theory is anywhere near correct, kernel version should not be important (but that's a big if). The problem is that I'm in a rhel environment and our kernels are pimped up 2.6.18s (that is, RedHat ported lots of features from newer kernels back). I can't easily test other versions.

I'm currently trying to home in on the reason for Derrik's antirecursion patch still causing deadlock. Here is what misbehave.c keeps doing:

Pid: 1543, comm: afs_background Tainted: P      2.6.18-128.1.6.el5xen #1
RIP: e030:[<ffffffff802181d0>]  [<ffffffff802181d0>] unlock_page+0xf/0x2f
...
Call Trace:
 [<ffffffff8021cf0e>] mpage_writepages+0x21d/0x34d
 [<ffffffff881adc32>] :libafs:afs_linux_writepage+0x0/0x83
 [<ffffffff802639f9>] _spin_lock_irqsave+0x9/0x14
 [<ffffffff80315034>] avc_has_perm_noaudit+0x208/0x36b
 [<ffffffff8028ba68>] printk+0x52/0xc6
 [<ffffffff80286721>] enqueue_task+0x41/0x56
 [<ffffffff8028678c>] __activate_task+0x56/0x6d
 [<ffffffff881b28b0>] :libafs:afsd_launcher+0x0/0x2c
 [<ffffffff8025c9fb>] do_writepages+0x29/0x2f
 [<ffffffff80250ee9>] __filemap_fdatawrite_range+0x50/0x5b
...

That happens some time after afs_linux_writepage_sync() returns because of the CPageWrite flag.

 - Felix

Reply via email to