On Mon, 27 Apr 2009, Felix Frank wrote:
re-entry into either writepage or entry in osi_VM_StoreAllSegments for
the same file if it is set. This looks sound. The net effect differs
from Chaskiel's suggestion in that it 1) disables
osi_Vm_StoreAllSegments on the same file for callers other than
doPartialWrite (probably a good idea), and 2) prevents concurrent
writepage calls within the same file (which might already be the
case).
Issues that remain:
- I think Felix still sees some deadlocks and data inconsistencies
with 2.6.18, but I can't reproduce with 2.6.29 or 2.6.30
There is reproduceable data loss during the mmap test, its amount being
dependent (linearly, it appears) on the size of physical memory. Corruption
seems to start right above 1/3 memory size.
Deadlocks still appear to occur above 1/2 memory size.
I've done a bunch of tests. The results are too confusing for me to even
bother you with the plots. Bottom line:
- working with mmap on a file that is smaller than the disk cache always works
- as soon as the file size exceeds the cache size, the cache gets junked
(from earlier observations, I guess the mmap_test prog reads 0s were
data should be)
- larger physical memory seems to enhance the chances for reading sound data,
but that's very statistically speaking. The numbers are very controversial
that way.
All these tests were done with a 1.4.10 client patched with
http://rt.central.org/rt/Ticket/Attachment/414217/450599/antirec-fix.patch
I disbelieve that the vanilla 1.4.10 would fare any different (if not
statistically worse, but what does it matter?) Early attempts with it
yielded similar levels of corruption.
The test program will still deadlock, by the way. There appears to be no
fixed minumum file size to make it happen, but during tests, the range of
"safe" file sizes appeared to grow in relation to the physical memory of
the client.
Traces of the usual deadlocked suspects are attached. At that point, just
about any process can deadlock, I suppose. Apparently, the system ceases
to balance dirty pages (which appears plausible to me, but I have no
experience with virtual memory implementations whatsoever).
Leaving writepage prematurely seems to be unsound after all. Derrik
suggested earlier (RT 120491) that VM handling for Linux is crooked, hence
this whole issue.
Is that still true? Is this going to not be addressed in 1.4.x?
Cheers
- Felix
pdflush D ffff8800020c5460 0 80 7 81 24 (L-TLB)
ffff88003fa59510 0000000000000246 ffff880001e08340 ffff880001612380
000000000000000a ffff88003fa327e0 ffff88003ff20860 0000000000000984
ffff88003fa329c8 ffff88003fa20040
Call Trace:
[<ffffffff802639f9>] _spin_lock_irqsave+0x9/0x14
[<ffffffff8026e733>] do_gettimeofday+0x1f8/0x213
[<ffffffff802639f9>] _spin_lock_irqsave+0x9/0x14
[<ffffffff8023f00c>] lock_timer_base+0x1b/0x3c
[<ffffffff8021cb5e>] __mod_timer+0xb0/0xbe
[<ffffffff8026277c>] schedule_timeout+0x8a/0xad
[<ffffffff80291141>] process_timeout+0x0/0x5
[<ffffffff802620ed>] io_schedule_timeout+0x4b/0x78
[<ffffffff8023c55e>] blk_congestion_wait+0x67/0x81
[<ffffffff80299fec>] autoremove_wake_function+0x0/0x2e
[<ffffffff80252c9f>] writeback_inodes+0xa8/0xd8
[<ffffffff802bde60>] balance_dirty_pages_ratelimited_nr+0x17d/0x1fa
[<ffffffff8021090a>] generic_file_buffered_write+0x527/0x645
[<ffffffff8020e3d2>] current_fs_time+0x3b/0x40
[<ffffffff80315c2f>] avc_has_perm+0x43/0x55
[<ffffffff8021685c>] __generic_file_aio_write_nolock+0x36c/0x3b8
[<ffffffff80221a14>] generic_file_aio_write+0x65/0xc1
[<ffffffff8804b1a2>] :ext3:ext3_file_write+0x16/0x91
[<ffffffff80218184>] do_sync_write+0xc7/0x104
[<ffffffff80299fec>] autoremove_wake_function+0x0/0x2e
[<ffffffff802629d6>] mutex_lock+0xd/0x1d
[<ffffffff80214210>] generic_file_llseek+0x7f/0x8b
[<ffffffff881a9d76>] :libafs:osi_rdwr+0xeb/0x151
[<ffffffff8818c573>] :libafs:afs_UFSWrite+0x5d0/0x84b
[<ffffffff881abc2c>] :libafs:afs_linux_writepage_sync+0x253/0x3da
[<ffffffff881adbfa>] :libafs:afs_linux_writepage+0x61/0x8a
[<ffffffff8021ce9c>] mpage_writepages+0x1ab/0x34d
[<ffffffff881adb99>] :libafs:afs_linux_writepage+0x0/0x8a
[<ffffffff8025c9fb>] do_writepages+0x29/0x2f
[<ffffffff80230c5b>] __writeback_single_inode+0x1ae/0x328
[<ffffffff802b3a73>] delayacct_end+0x5d/0x86
[<ffffffff8022120d>] sync_sb_inodes+0x1a9/0x267
[<ffffffff80299dd4>] keventd_create_kthread+0x0/0xc4
[<ffffffff80252c79>] writeback_inodes+0x82/0xd8
[<ffffffff802bdf62>] background_writeout+0x85/0xb8
[<ffffffff8025801c>] pdflush+0x0/0x207
[<ffffffff80258175>] pdflush+0x159/0x207
[<ffffffff802bdedd>] background_writeout+0x0/0xb8
[<ffffffff80233483>] kthread+0xfe/0x132
[<ffffffff8025fb2c>] child_rip+0xa/0x12
[<ffffffff80299dd4>] keventd_create_kthread+0x0/0xc4
[<ffffffff8026df02>] monotonic_clock+0x35/0x7b
[<ffffffff80233385>] kthread+0x0/0x132
[<ffffffff8025fb22>] child_rip+0x0/0x12
afsd D ffff8800020c5460 0 1537 1 1539 1535 (L-TLB)
ffff880031013880 0000000000000246 000000000002be5d 0000000000000246
000000000000000a ffff88003fa20040 ffff88003fa327e0 0000000000000dbe
ffff88003fa20228 ffffffff804e0a80
Call Trace:
[<ffffffff802639f9>] _spin_lock_irqsave+0x9/0x14
[<ffffffff8026e733>] do_gettimeofday+0x1f8/0x213
[<ffffffff802639f9>] _spin_lock_irqsave+0x9/0x14
[<ffffffff8023f00c>] lock_timer_base+0x1b/0x3c
[<ffffffff8021cb5e>] __mod_timer+0xb0/0xbe
[<ffffffff8026277c>] schedule_timeout+0x8a/0xad
[<ffffffff80291141>] process_timeout+0x0/0x5
[<ffffffff802620ed>] io_schedule_timeout+0x4b/0x78
[<ffffffff8023c55e>] blk_congestion_wait+0x67/0x81
[<ffffffff80299fec>] autoremove_wake_function+0x0/0x2e
[<ffffffff80252c9f>] writeback_inodes+0xa8/0xd8
[<ffffffff802bde60>] balance_dirty_pages_ratelimited_nr+0x17d/0x1fa
[<ffffffff8021090a>] generic_file_buffered_write+0x527/0x645
[<ffffffff8022eca0>] __wake_up+0x38/0x4f
[<ffffffff80207141>] kmem_cache_free+0x80/0xd3
[<ffffffff880317ae>] :jbd:journal_stop+0x1f3/0x1ff
[<ffffffff8020e3d2>] current_fs_time+0x3b/0x40
[<ffffffff88054977>] :ext3:__ext3_journal_stop+0x1f/0x3d
[<ffffffff8021685c>] __generic_file_aio_write_nolock+0x36c/0x3b8
[<ffffffff80221a1f>] generic_file_aio_write+0x70/0xc1
[<ffffffff80221a14>] generic_file_aio_write+0x65/0xc1
[<ffffffff8804b1a2>] :ext3:ext3_file_write+0x16/0x91
[<ffffffff80218184>] do_sync_write+0xc7/0x104
[<ffffffff80299fec>] autoremove_wake_function+0x0/0x2e
[<ffffffff802629d6>] mutex_lock+0xd/0x1d
[<ffffffff80214210>] generic_file_llseek+0x7f/0x8b
[<ffffffff881a9d76>] :libafs:osi_rdwr+0xeb/0x151
[<ffffffff881a971b>] :libafs:afs_osi_Write+0xe2/0x16f
[<ffffffff881690cf>] :libafs:afs_WriteDCache+0x92/0xa7
[<ffffffff8816ab2c>] :libafs:afs_WriteThroughDSlots+0x1e2/0x309
[<ffffffff88168172>] :libafs:afs_Daemon+0x18a/0x474
[<ffffffff881b281c>] :libafs:afsd_launcher+0x0/0x2c
[<ffffffff881b2a56>] :libafs:afsd_thread+0x20e/0x6f7
[<ffffffff8025fb2c>] child_rip+0xa/0x12
[<ffffffff881b281c>] :libafs:afsd_launcher+0x0/0x2c
[<ffffffff8026df02>] monotonic_clock+0x35/0x7b
[<ffffffff881b2848>] :libafs:afsd_thread+0x0/0x6f7
[<ffffffff8025fb22>] child_rip+0x0/0x12
mmap_test_tem D ffff8800020c5460 0 2042 2041 (NOTLB)
ffff88003617dbe8 0000000000000282 0000000000000246 000000000000000a
0000000000000009 ffff88003ff20860 ffffffff804e0a80 0000000000000aaf
ffff88003ff20a48 ffff88003fa327e0
Call Trace:
[<ffffffff802639f9>] _spin_lock_irqsave+0x9/0x14
[<ffffffff8026e733>] do_gettimeofday+0x1f8/0x213
[<ffffffff802639f9>] _spin_lock_irqsave+0x9/0x14
[<ffffffff8023f00c>] lock_timer_base+0x1b/0x3c
[<ffffffff8021cb5e>] __mod_timer+0xb0/0xbe
[<ffffffff8026277c>] schedule_timeout+0x8a/0xad
[<ffffffff80291141>] process_timeout+0x0/0x5
[<ffffffff802620ed>] io_schedule_timeout+0x4b/0x78
[<ffffffff8023c55e>] blk_congestion_wait+0x67/0x81
[<ffffffff80299fec>] autoremove_wake_function+0x0/0x2e
[<ffffffff80252c9f>] writeback_inodes+0xa8/0xd8
[<ffffffff802bde60>] balance_dirty_pages_ratelimited_nr+0x17d/0x1fa
[<ffffffff80211ad2>] do_wp_page+0x66f/0x6a3
[<ffffffff80209ac4>] __handle_mm_fault+0x114b/0x11f6
[<ffffffff8020622a>] hypercall_page+0x22a/0x1000
[<ffffffff802639f9>] _spin_lock_irqsave+0x9/0x14
[<ffffffff802666ef>] do_page_fault+0xf7b/0x12e0
[<ffffffff8026df02>] monotonic_clock+0x35/0x7b
[<ffffffff80261e83>] thread_return+0x6c/0x113
[<ffffffff8025f82b>] error_exit+0x0/0x6e
[<ffffffff8025f82b>] error_exit+0x0/0x6e