On Fri, 17 Apr 2009, Marc Dionne wrote:
On Fri, Apr 17, 2009 at 2:13 AM, Felix Frank <[email protected]> wrote:
With the fix above, my larger mmap test quickly runs into a deadlock
again. Looks like cache_write_pages is trying to lock the page that is
currently being written:
I think I just reproduced :/
Guess we're back to square one then. I posted a hack to RT #124627 yesterday
that does prevent deadlock, but apparently much data won't ever get written
to the cache and mmap_test reports corruptions (gets lots of 0s). So what to
do instead of osi_VM_StoreAllSegments() during partial writes?
Regards
- Felix
I tried something almost identical yesterday (but I was lazier and
just used the AFS_VMSYNC_INVAL flag), and it does prevent the
deadlock.
I don't see any corruption however - could you post your test case
here (I had trouble getting it from AFS), and also, can you give a bit
more info - cache size, memory size, etc. Kernel versions might make
a difference - I'm testing with latest 2.6.30-rc.
I had my little test program (misbehave.c) attached to my mail from
09:08:39 +0200. It's not the one that causes trouble with this workaround
(or did I say it does? sorry, i'm starting to get a little confused.)
On the test system I used, the problems start with mmap_test writing
around 600MB or so (64MB disk cache).
It does the same with vanilla 1.4.10, but I had a feeling that it happens
"sooner" with the workaround in place.
If my theory is anywhere near correct, kernel version should not be
important (but that's a big if).
The problem is that I'm in a rhel environment and our kernels are pimped
up 2.6.18s (that is, RedHat ported lots of features from newer kernels
back). I can't easily test other versions.
I'm currently trying to home in on the reason for Derrik's antirecursion
patch still causing deadlock. Here is what misbehave.c keeps doing:
Pid: 1543, comm: afs_background Tainted: P 2.6.18-128.1.6.el5xen #1
RIP: e030:[<ffffffff802181d0>] [<ffffffff802181d0>] unlock_page+0xf/0x2f
...
Call Trace:
[<ffffffff8021cf0e>] mpage_writepages+0x21d/0x34d
[<ffffffff881adc32>] :libafs:afs_linux_writepage+0x0/0x83
[<ffffffff802639f9>] _spin_lock_irqsave+0x9/0x14
[<ffffffff80315034>] avc_has_perm_noaudit+0x208/0x36b
[<ffffffff8028ba68>] printk+0x52/0xc6
[<ffffffff80286721>] enqueue_task+0x41/0x56
[<ffffffff8028678c>] __activate_task+0x56/0x6d
[<ffffffff881b28b0>] :libafs:afsd_launcher+0x0/0x2c
[<ffffffff8025c9fb>] do_writepages+0x29/0x2f
[<ffffffff80250ee9>] __filemap_fdatawrite_range+0x50/0x5b
...
That happens some time after afs_linux_writepage_sync() returns because of
the CPageWrite flag.
- Felix