Re: GPF in aio_migratepage
Kristian Nielsen writes: > Benjamin LaHaise writes: > >> Linus just pushed out 3.13-rc5 that has changes to aio_migratepage() that >> should make it much more robust, as well as other fixes. Can you please >> give it a spin as well and let me know if it works? Thanks a bunch! > > Ok, will do. JFYI, I have been running -rc5 (and later -rc7) for a month now without seeing anything like this again. So hopefully the issue is solved. - Kristian. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GPF in aio_migratepage
Kristian Nielsen kniel...@knielsen-hq.org writes: Benjamin LaHaise b...@kvack.org writes: Linus just pushed out 3.13-rc5 that has changes to aio_migratepage() that should make it much more robust, as well as other fixes. Can you please give it a spin as well and let me know if it works? Thanks a bunch! Ok, will do. JFYI, I have been running -rc5 (and later -rc7) for a month now without seeing anything like this again. So hopefully the issue is solved. - Kristian. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GPF in aio_migratepage
Benjamin LaHaise writes: > Linus just pushed out 3.13-rc5 that has changes to aio_migratepage() that > should make it much more robust, as well as other fixes. Can you please > give it a spin as well and let me know if it works? Thanks a bunch! Ok, will do. - Kristian. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GPF in aio_migratepage
Gu Zheng writes: > This issue seems like a problem that has been fixed yet: > http://article.gmane.org/gmane.linux.kernel.aio.general/3741/match=potential+use+after+free+aio%5fmigratepage > commit 5e9ae2e5da0beb93f8557fc92a8f4fbc05ea448f > aio: fix use-after-free in aio_migratepage > So I think maybe you can run with latest Linus' tree or 3.13-rc4 to > check whether this issue still appears. Hm. I checked that thread, and as far as I can see, that patch was already included in the tree I hit the BUG in (3.13-rc1): http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5e9ae2e5da0beb93f8557fc92a8f4fbc05ea448f There are other changes in that area since 3.13-rc1 though. Anyway, I am now running with 3.13-rc4 and will report if I see anything. Given that I do not have any way to reproduce (I only ever saw this once), this seems the best that can be done for now. Thanks for following up on this! - Kristian. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GPF in aio_migratepage
Gu Zheng guz.f...@cn.fujitsu.com writes: This issue seems like a problem that has been fixed yet: http://article.gmane.org/gmane.linux.kernel.aio.general/3741/match=potential+use+after+free+aio%5fmigratepage commit 5e9ae2e5da0beb93f8557fc92a8f4fbc05ea448f aio: fix use-after-free in aio_migratepage So I think maybe you can run with latest Linus' tree or 3.13-rc4 to check whether this issue still appears. Hm. I checked that thread, and as far as I can see, that patch was already included in the tree I hit the BUG in (3.13-rc1): http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5e9ae2e5da0beb93f8557fc92a8f4fbc05ea448f There are other changes in that area since 3.13-rc1 though. Anyway, I am now running with 3.13-rc4 and will report if I see anything. Given that I do not have any way to reproduce (I only ever saw this once), this seems the best that can be done for now. Thanks for following up on this! - Kristian. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GPF in aio_migratepage
Benjamin LaHaise b...@kvack.org writes: Linus just pushed out 3.13-rc5 that has changes to aio_migratepage() that should make it much more robust, as well as other fixes. Can you please give it a spin as well and let me know if it works? Thanks a bunch! Ok, will do. - Kristian. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GPF in aio_migratepage
What is the status of this? If I understand correctly, the crash I saw is different from what Dave saw. There was one patched scheduled for inclusion that fixes Dave's crash. But what about mine? I have been running 3.13-rc2 for a couple of weeks now with your other patch, without seeing it again, which suggests it has helped. But it seems that patch has a locking bug as described by Dave (sleeping under spinlock)? So this appears unsolved as of yet... So I just wanted to check that this was not forgotten. Is there something I can do to help get this sorted out? Should I try to run with unpatched -rc4 for some time to check if it appears again? Anything else? - Kristian. Dave Jones writes: > On Mon, Dec 02, 2013 at 06:10:46PM +0800, Gu Zheng wrote: > > Hi Kristian, Dave, > > > > Could you please help to check whether the following patch can fix this > issue? > > This introduces some locking bugs.. > > > [ 222.327950] BUG: sleeping function called from invalid context at > kernel/locking/mutex.c:616 > [ 222.328004] in_atomic(): 1, irqs_disabled(): 0, pid: 12794, name: > trinity-child1 > [ 222.328044] 1 lock held by trinity-child1/12794: > [ 222.328072] #0: (&(>private_lock)->rlock){+.+...}, at: > [] aio_free_ring+0x44/0x160 > [ 222.328147] CPU: 1 PID: 12794 Comm: trinity-child1 Not tainted 3.13.0-rc2+ > #12 > [ 222.328268] 0268 880229517d68 8173bc52 > > [ 222.328320] 880229517d90 8108ad95 880223b6acd0 > > [ 222.328370] 880229517e08 81741cf3 > 880229517dc0 > [ 222.328421] Call Trace: > [ 222.328443] [] dump_stack+0x4e/0x7a > [ 222.328475] [] __might_sleep+0x175/0x200 > [ 222.328510] [] mutex_lock_nested+0x33/0x400 > [ 222.328545] [] unmap_mapping_range+0x68/0x170 > [ 222.328582] [] truncate_pagecache+0x35/0x60 > [ 222.328617] [] truncate_setsize+0x12/0x20 > [ 222.328651] [] aio_free_ring+0x99/0x160 > [ 222.328684] [] SyS_io_setup+0xef1/0xf00 > [ 222.328717] [] tracesys+0xdd/0xe2 > > [ 222.328769] == > [ 222.328804] [ INFO: possible circular locking dependency detected ] > [ 222.328838] 3.13.0-rc2+ #12 Not tainted > [ 222.328862] --- > [ 222.328896] trinity-child1/12794 is trying to acquire lock: > [ 222.328928] (>i_mmap_mutex){+.+...}, at: [] > unmap_mapping_range+0x68/0x170 > [ 222.328987] > but task is already holding lock: > [ 222.329020] (&(>private_lock)->rlock){+.+...}, at: > [] aio_free_ring+0x44/0x160 > [ 222.329081] > which lock already depends on the new lock. > > [ 222.329125] > the existing dependency chain (in reverse order) is: > [ 222.329166] > -> #2 (&(>private_lock)->rlock){+.+...}: > [ 222.329211][] lock_acquire+0x93/0x1c0 > [ 222.329248][] _raw_spin_lock+0x40/0x80 > [ 222.329285][] __set_page_dirty_buffers+0x2d/0xb0 > [ 222.331243][] set_page_dirty+0x3a/0x60 > [ 222.334437][] unmap_single_vma+0x62f/0x830 > [ 222.337633][] unmap_vmas+0x49/0x90 > [ 222.340819][] unmap_region+0x9d/0x110 > [ 222.343968][] do_munmap+0x226/0x3b0 > [ 222.346689][] vm_munmap+0x44/0x60 > [ 222.349741][] SyS_munmap+0x22/0x30 > [ 222.352758][] tracesys+0xdd/0xe2 > [ 222.355735] > -> #1 (&(ptlock_ptr(page))->rlock#2){+.+...}: > [ 222.361611][] lock_acquire+0x93/0x1c0 > [ 222.364589][] _raw_spin_lock+0x40/0x80 > [ 222.367200][] __page_check_address+0x98/0x160 > [ 222.370168][] page_mkclean+0xfe/0x1c0 > [ 222.373120][] clear_page_dirty_for_io+0x60/0x100 > [ 222.376076][] mpage_submit_page+0x47/0x80 > [ 222.379015][] mpage_process_page_bufs+0x110/0x130 > [ 222.381955][] > mpage_prepare_extent_to_map+0x22b/0x2f0 > [ 222.384895][] ext4_writepages+0x4ef/0x1050 > [ 222.387839][] do_writepages+0x21/0x50 > [ 222.390786][] > __filemap_fdatawrite_range+0x59/0x60 > [ 222.393747][] > filemap_write_and_wait_range+0x2d/0x70 > [ 222.396729][] ext4_sync_file+0xba/0x4d0 > [ 222.399714][] do_fsync+0x51/0x80 > [ 222.402317][] SyS_fsync+0x10/0x20 > [ 222.405240][] tracesys+0xdd/0xe2 > [ 222.407760] > -> #0 (>i_mmap_mutex){+.+...}: > [ 222.413349][] __lock_acquire+0x1786/0x1af0 > [ 222.416127][] lock_acquire+0x93/0x1c0 > [ 222.418826][] mutex_lock_nested+0x77/0x400 > [ 222.421456][] unmap_mapping_range+0x68/0x170 > [ 222.424085][] truncate_pagecache+0x35/0x60 > [ 222.426696][] truncate_setsize+0x12/0x20 > [ 222.428955][] aio_free_ring+0x99/0x160 > [ 222.431509][] SyS_io_setup+0xef1/0xf00 > [ 222.434069][] tracesys+0xdd/0xe2 > [ 222.436308] > other info that might help us debug this: > > [ 222.443857] Chain exists of: >
Re: GPF in aio_migratepage
What is the status of this? If I understand correctly, the crash I saw is different from what Dave saw. There was one patched scheduled for inclusion that fixes Dave's crash. But what about mine? I have been running 3.13-rc2 for a couple of weeks now with your other patch, without seeing it again, which suggests it has helped. But it seems that patch has a locking bug as described by Dave (sleeping under spinlock)? So this appears unsolved as of yet... So I just wanted to check that this was not forgotten. Is there something I can do to help get this sorted out? Should I try to run with unpatched -rc4 for some time to check if it appears again? Anything else? - Kristian. Dave Jones da...@redhat.com writes: On Mon, Dec 02, 2013 at 06:10:46PM +0800, Gu Zheng wrote: Hi Kristian, Dave, Could you please help to check whether the following patch can fix this issue? This introduces some locking bugs.. [ 222.327950] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:616 [ 222.328004] in_atomic(): 1, irqs_disabled(): 0, pid: 12794, name: trinity-child1 [ 222.328044] 1 lock held by trinity-child1/12794: [ 222.328072] #0: ((mapping-private_lock)-rlock){+.+...}, at: [81210a64] aio_free_ring+0x44/0x160 [ 222.328147] CPU: 1 PID: 12794 Comm: trinity-child1 Not tainted 3.13.0-rc2+ #12 [ 222.328268] 0268 880229517d68 8173bc52 [ 222.328320] 880229517d90 8108ad95 880223b6acd0 [ 222.328370] 880229517e08 81741cf3 880229517dc0 [ 222.328421] Call Trace: [ 222.328443] [8173bc52] dump_stack+0x4e/0x7a [ 222.328475] [8108ad95] __might_sleep+0x175/0x200 [ 222.328510] [81741cf3] mutex_lock_nested+0x33/0x400 [ 222.328545] [81179e68] unmap_mapping_range+0x68/0x170 [ 222.328582] [81160a35] truncate_pagecache+0x35/0x60 [ 222.328617] [81160a72] truncate_setsize+0x12/0x20 [ 222.328651] [81210ab9] aio_free_ring+0x99/0x160 [ 222.328684] [81213071] SyS_io_setup+0xef1/0xf00 [ 222.328717] [8174eaa4] tracesys+0xdd/0xe2 [ 222.328769] == [ 222.328804] [ INFO: possible circular locking dependency detected ] [ 222.328838] 3.13.0-rc2+ #12 Not tainted [ 222.328862] --- [ 222.328896] trinity-child1/12794 is trying to acquire lock: [ 222.328928] (mapping-i_mmap_mutex){+.+...}, at: [81179e68] unmap_mapping_range+0x68/0x170 [ 222.328987] but task is already holding lock: [ 222.329020] ((mapping-private_lock)-rlock){+.+...}, at: [81210a64] aio_free_ring+0x44/0x160 [ 222.329081] which lock already depends on the new lock. [ 222.329125] the existing dependency chain (in reverse order) is: [ 222.329166] - #2 ((mapping-private_lock)-rlock){+.+...}: [ 222.329211][810af833] lock_acquire+0x93/0x1c0 [ 222.329248][817454f0] _raw_spin_lock+0x40/0x80 [ 222.329285][811f334d] __set_page_dirty_buffers+0x2d/0xb0 [ 222.331243][8115aada] set_page_dirty+0x3a/0x60 [ 222.334437][81179a7f] unmap_single_vma+0x62f/0x830 [ 222.337633][8117ad19] unmap_vmas+0x49/0x90 [ 222.340819][811804bd] unmap_region+0x9d/0x110 [ 222.343968][811829f6] do_munmap+0x226/0x3b0 [ 222.346689][81182bc4] vm_munmap+0x44/0x60 [ 222.349741][81183b42] SyS_munmap+0x22/0x30 [ 222.352758][8174eaa4] tracesys+0xdd/0xe2 [ 222.355735] - #1 ((ptlock_ptr(page))-rlock#2){+.+...}: [ 222.361611][810af833] lock_acquire+0x93/0x1c0 [ 222.364589][817454f0] _raw_spin_lock+0x40/0x80 [ 222.367200][81186338] __page_check_address+0x98/0x160 [ 222.370168][811864fe] page_mkclean+0xfe/0x1c0 [ 222.373120][8115ad60] clear_page_dirty_for_io+0x60/0x100 [ 222.376076][8124d207] mpage_submit_page+0x47/0x80 [ 222.379015][8124d350] mpage_process_page_bufs+0x110/0x130 [ 222.381955][8124d91b] mpage_prepare_extent_to_map+0x22b/0x2f0 [ 222.384895][8125318f] ext4_writepages+0x4ef/0x1050 [ 222.387839][8115cdf1] do_writepages+0x21/0x50 [ 222.390786][81150959] __filemap_fdatawrite_range+0x59/0x60 [ 222.393747][81150a5d] filemap_write_and_wait_range+0x2d/0x70 [ 222.396729][812498ca] ext4_sync_file+0xba/0x4d0 [ 222.399714][811f1691] do_fsync+0x51/0x80 [ 222.402317][811f1980] SyS_fsync+0x10/0x20 [ 222.405240][8174eaa4] tracesys+0xdd/0xe2 [ 222.407760] - #0 (mapping-i_mmap_mutex){+.+...}: [ 222.413349]
Re: GPF in aio_migratepage
Gu Zheng writes: > Hi Kristian, Dave, > > Could you please help to check whether the following patch can fix this issue? > Signed-off-by: Gu Zheng > --- > fs/aio.c | 28 ++-- > 1 files changed, 10 insertions(+), 18 deletions(-) > Ok. I've applied the patch to 3.13-rc2 and will give it a spin. I will let you know if I encounter the failure again with the patch. Thanks, - Kristian. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GPF in aio_migratepage
Gu Zheng guz.f...@cn.fujitsu.com writes: Hi Kristian, Dave, Could you please help to check whether the following patch can fix this issue? Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- fs/aio.c | 28 ++-- 1 files changed, 10 insertions(+), 18 deletions(-) Ok. I've applied the patch to 3.13-rc2 and will give it a spin. I will let you know if I encounter the failure again with the patch. Thanks, - Kristian. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GPF in aio_migratepage
Benjamin LaHaise writes: > For Dave: what line is this bug on? Is it the dereference of ctx when > doing spin_lock_irqsave(>completion_lock, flags); or is the > ctx->ring_pages[idx] = new; ? From the 64 bit splat, I'm thinking the > former, which is quite strange given that the clearing of > mapping->private_data is protected by mapping->private_lock. If it's > the latter, we might well need to check if ctx->ring_pages is NULL during > setup. I think I got the same BUG (at least it looks very similar, full details below). The bug is on this line: ctx->ring_pages[idx] = new; Disassembly: af7: 48 89 2c d1mov%rbp,(%rcx,%rdx,8) ctx->ring_pages is 0x (this is x86_64). idx is 13. RCX: RDX: 000d BUG: unable to handle kernel NULL pointer dereference at 0067 So we are de-referencing a pointer that is (page **)-1, causing the crash. If you look closer at the 32-bit dump that Dave gave, you can see that it is similar: 7a2: 89 34 82mov%esi,(%edx,%eax,4) RAX: 6b6b6b6b6b6b6b6b RDX: Though in this case ctx->ring_pages seems to be NULL and idx=old->index seems to be 6b6b6b6b6b6b6b6b, so not completely the same (or maybe I read his dump incorrectly). This is 3.13-rc1. Unfortunately, I do not have a way to reproduce (so far I only saw it this once). But I can see if it turns up again, or should I install -rc2 and see if it goes away? I was not doing anything special at the time, normal desktop load (I was using the evince pdf viewer). Let me know if there is anything else I can do to help track this down? - Kristian. Full details: I put my .config here: http://knielsen-hq.org/config-3.13-rc1-gpf-in-aio-migratepage.txt BUG output: BUG: unable to handle kernel NULL pointer dereference at 0067 IP: [] aio_migratepage+0xb3/0xe4 PGD 0 Oops: 0002 [#1] SMP Modules linked in: tun parport_pc ppdev lp parport bnep rfcomm bluetooth cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative binfmt_misc uinput fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc ext3 jbd loop snd_hda_codec_hdmi hid_generic usbhid hid joydev ums_realtek usb_storage snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support arc4 brcmsmac cordic brcmutil b43 mac80211 cfg80211 ssb mmc_core rfkill rng_core pcmcia pcmcia_core nouveau mxm_wmi wmi x86_pkg_temp_thermal coretemp snd_hda_intel kvm_intel snd_hda_codec snd_hwdep snd_pcm_oss kvm snd_mixer_oss snd_seq_midi snd_seq_midi_event snd_pcm crc32c_intel snd_rawmidi snd_page_alloc snd_seq ghash_clmulni_intel snd_timer snd_seq_device lpc_ich aesni_intel mfd_core ttm battery aes_x86_64 ablk_helper drm_kms_helper cryptd lrw gf128mul drm glue_helper psmouse snd pcspkr serio_raw i2c_i801 evdev ehci_pci soundcore ehci_hcd bcma ac acpi_cpufreq video button processor ext4 crc16 jbd2 mbc r_mod cdrom crc_t10dif crct10dif_common microcode ahci libahci xhci_hcd libata usbcore scsi_mod usb_common fan thermal thermal_sys r8169 mii CPU: 2 PID: 15596 Comm: evince Not tainted 3.13.0-rc1-kn #1 Hardware name: Compal PBL2021/Base Board Product Name, BIOS 2.40 08/26/2011 task: 88010322f7c0 ti: 880102b48000 task.ti: 880102b48000 RIP: 0010:[] [] aio_migratepage+0xb3/0xe4 RSP: 0018:880102b49798 EFLAGS: 00010213 RAX: 0286 RBX: ea00038f1640 RCX: RDX: 000d RSI: ea00038f1640 RDI: ea00038f1640 RBP: ea0007b6a800 R08: R09: 000d R10: 0038 R11: ea0007b6a800 R12: 880144a30d00 R13: R14: 88014ba5b1f8 R15: 880144a30ec4 FS: 7f68ecfe8960() GS:88024f48() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0067 CR3: 51ee8000 CR4: 000407e0 Stack: 000e 0286 88024f7f6d80 ea00038f1640 ea0007b6a800 88014ba5b170 0001 0001 810ffc68 88014ba5b1a8 Call Trace: [] ? move_to_new_page+0x84/0x1ab [] ? get_page+0x9/0x25 [] ? migrate_pages+0x330/0x524 [] ? isolate_freepages_block+0x237/0x237 [] ? compact_zone+0x13a/0x301 [] ? compact_zone_order+0x94/0xa7 [] ? try_to_compact_pages+0x98/0xec [] ? __alloc_pages_direct_compact+0xa9/0x19a [] ? __alloc_pages_nodemask+0x46f/0x7f3 [] ? __kmalloc_reserve.isra.42+0x2a/0x6d [] ? alloc_pages_current+0xac/0xc6 [] ? sock_alloc_send_pskb+0x1fc/0x345 [] ? memcpy_fromiovecend+0x48/0x6f [] ? skb_copy_datagram_from_iovec+0x128/0x1f2 [] ? sk_wake_async+0x19/0x3c [] ? unix_stream_sendmsg+0x12e/0x2e9 [] ? sock_aio_write+0xc0/0xd5 [] ? set_restore_sigmask+0x2d/0x2d [] ? do_sync_readv_writev+0x48/0x6b [] ? sock_alloc_file+0x119/0x119 [] ? do_readv_writev+0xb4/0x121 [] ? sock_alloc_file+0x119/0x119 [] ? __switch_to+0x1b1/0x3de [] ? fget_light+0x6b/0x7c [] ? fdget+0xe/0x17 []
Re: GPF in aio_migratepage
Benjamin LaHaise b...@kvack.org writes: For Dave: what line is this bug on? Is it the dereference of ctx when doing spin_lock_irqsave(ctx-completion_lock, flags); or is the ctx-ring_pages[idx] = new; ? From the 64 bit splat, I'm thinking the former, which is quite strange given that the clearing of mapping-private_data is protected by mapping-private_lock. If it's the latter, we might well need to check if ctx-ring_pages is NULL during setup. I think I got the same BUG (at least it looks very similar, full details below). The bug is on this line: ctx-ring_pages[idx] = new; Disassembly: af7: 48 89 2c d1mov%rbp,(%rcx,%rdx,8) ctx-ring_pages is 0x (this is x86_64). idx is 13. RCX: RDX: 000d BUG: unable to handle kernel NULL pointer dereference at 0067 So we are de-referencing a pointer that is (page **)-1, causing the crash. If you look closer at the 32-bit dump that Dave gave, you can see that it is similar: 7a2: 89 34 82mov%esi,(%edx,%eax,4) RAX: 6b6b6b6b6b6b6b6b RDX: Though in this case ctx-ring_pages seems to be NULL and idx=old-index seems to be 6b6b6b6b6b6b6b6b, so not completely the same (or maybe I read his dump incorrectly). This is 3.13-rc1. Unfortunately, I do not have a way to reproduce (so far I only saw it this once). But I can see if it turns up again, or should I install -rc2 and see if it goes away? I was not doing anything special at the time, normal desktop load (I was using the evince pdf viewer). Let me know if there is anything else I can do to help track this down? - Kristian. Full details: I put my .config here: http://knielsen-hq.org/config-3.13-rc1-gpf-in-aio-migratepage.txt BUG output: BUG: unable to handle kernel NULL pointer dereference at 0067 IP: [8113d73f] aio_migratepage+0xb3/0xe4 PGD 0 Oops: 0002 [#1] SMP Modules linked in: tun parport_pc ppdev lp parport bnep rfcomm bluetooth cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative binfmt_misc uinput fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc ext3 jbd loop snd_hda_codec_hdmi hid_generic usbhid hid joydev ums_realtek usb_storage snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support arc4 brcmsmac cordic brcmutil b43 mac80211 cfg80211 ssb mmc_core rfkill rng_core pcmcia pcmcia_core nouveau mxm_wmi wmi x86_pkg_temp_thermal coretemp snd_hda_intel kvm_intel snd_hda_codec snd_hwdep snd_pcm_oss kvm snd_mixer_oss snd_seq_midi snd_seq_midi_event snd_pcm crc32c_intel snd_rawmidi snd_page_alloc snd_seq ghash_clmulni_intel snd_timer snd_seq_device lpc_ich aesni_intel mfd_core ttm battery aes_x86_64 ablk_helper drm_kms_helper cryptd lrw gf128mul drm glue_helper psmouse snd pcspkr serio_raw i2c_i801 evdev ehci_pci soundcore ehci_hcd bcma ac acpi_cpufreq video button processor ext4 crc16 jbd2 mbc r_mod cdrom crc_t10dif crct10dif_common microcode ahci libahci xhci_hcd libata usbcore scsi_mod usb_common fan thermal thermal_sys r8169 mii CPU: 2 PID: 15596 Comm: evince Not tainted 3.13.0-rc1-kn #1 Hardware name: Compal PBL2021/Base Board Product Name, BIOS 2.40 08/26/2011 task: 88010322f7c0 ti: 880102b48000 task.ti: 880102b48000 RIP: 0010:[8113d73f] [8113d73f] aio_migratepage+0xb3/0xe4 RSP: 0018:880102b49798 EFLAGS: 00010213 RAX: 0286 RBX: ea00038f1640 RCX: RDX: 000d RSI: ea00038f1640 RDI: ea00038f1640 RBP: ea0007b6a800 R08: R09: 000d R10: 0038 R11: ea0007b6a800 R12: 880144a30d00 R13: R14: 88014ba5b1f8 R15: 880144a30ec4 FS: 7f68ecfe8960() GS:88024f48() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0067 CR3: 51ee8000 CR4: 000407e0 Stack: 000e 0286 88024f7f6d80 ea00038f1640 ea0007b6a800 88014ba5b170 0001 0001 810ffc68 88014ba5b1a8 Call Trace: [810ffc68] ? move_to_new_page+0x84/0x1ab [810cbcbd] ? get_page+0x9/0x25 [8110019e] ? migrate_pages+0x330/0x524 [810dac77] ? isolate_freepages_block+0x237/0x237 [810db651] ? compact_zone+0x13a/0x301 [810dba3e] ? compact_zone_order+0x94/0xa7 [810dbae9] ? try_to_compact_pages+0x98/0xec [8138ef42] ? __alloc_pages_direct_compact+0xa9/0x19a [810c8567] ? __alloc_pages_nodemask+0x46f/0x7f3 [812cf2bc] ? __kmalloc_reserve.isra.42+0x2a/0x6d [810f64df] ? alloc_pages_current+0xac/0xc6 [812cbd47] ? sock_alloc_send_pskb+0x1fc/0x345 [812d2625] ? memcpy_fromiovecend+0x48/0x6f [812d2ac5] ? skb_copy_datagram_from_iovec+0x128/0x1f2 [812ca529] ? sk_wake_async+0x19/0x3c [8134c605] ?
Re: [ext3/4] PROBLEM: fdatasync not syncing appended data (w/test program)
Jan Kara writes: > On Mon 03-09-12 10:45:15, Kristian Nielsen wrote: >> It appears that ext3 and ext4 fdatasync() does not fully sync data to >> disk. Specifically, when new data is written at the end (so that the file >> length is increased), not all of the new data is synced by fdatasync(). > The culprit is that we forget to update i_datasync_tid when we change only > inode size. Thus inode is not forced to disk during fdatasync(). I will send > fixes for this in a moment. Cool, thanks for the quick response! Glad that we get this fixed. - Kristian. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3/4] PROBLEM: fdatasync not syncing appended data (w/test program)
Jan Kara j...@suse.cz writes: On Mon 03-09-12 10:45:15, Kristian Nielsen wrote: It appears that ext3 and ext4 fdatasync() does not fully sync data to disk. Specifically, when new data is written at the end (so that the file length is increased), not all of the new data is synced by fdatasync(). The culprit is that we forget to update i_datasync_tid when we change only inode size. Thus inode is not forced to disk during fdatasync(). I will send fixes for this in a moment. Cool, thanks for the quick response! Glad that we get this fixed. - Kristian. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[ext3/4] PROBLEM: fdatasync not syncing appended data (w/test program)
It appears that ext3 and ext4 fdatasync() does not fully sync data to disk. Specifically, when new data is written at the end (so that the file length is increased), not all of the new data is synced by fdatasync(). It looks as if the problem is when the new data fits in the last allocated page for the file - then ext3/4 flushes the new data page to disk, but _not_ the new metadata with the longer file size. This seems in violation of how fdatasync() is supposed to perform, ie. from the Linux man page: fdatasync() ... does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled ... For example, changes to st_atime or st_mtime ... do not require flushing ... On the other hand, a change to the file size (st_size, as made by say ftruncate(2)), would require a metadata flush. It also causes real problems - it was found because it causes silent data corruption on ext3/ext4 of MySQL/MariaDB in case of a crash. XFS appears not to be affected. To reproduce, I run code like this (full test program attached) with disk cache disabled, and pull the power plug: for (seq = 0; seq < COUNT; ++seq) { snprintf(buf, sizeof(buf), "%7d\n", seq); write(append_fd, buf, RECSIZE); fdatasync(append_fd); memcpy(buf2 + (seq * RECSIZE) % BLOCKSIZE, buf, RECSIZE); pwrite(inplace_fd, buf2, BLOCKSIZE, 0); fdatasync(inplace_fd); } The append_fd is a new file that is appended to in each write; the inplace_fd is a pre-allocated file that does not change size. After a crash the two files should always be in sync (or append_fd can be one record ahead). But on ext3 and ext4, append_fd typically lacks several records. This does not happen on XFS, or if fdatasync(append_fd) is replaced with fsync(append_fd). I also tested this in kvm, and used strace on the host process to see exactly which writes the guest does. Each fdatasync(append_fd) causes a flush to disk of the data page of append_fd, but not of the inode (or journal) page. So even if fdatasync() is by design not suitable for appending data to a file, there still seems to be a bug, though then a performance rather than a correctness bug: there is no reason to flush the new data block if the increased file length is not flushed also. I tested the following kernels, all show the problem: Linux vm-lucid-amd64 2.6.32-39-server #86-Ubuntu SMP Mon Feb 13 23:15:11 UTC 2012 x86_64 GNU/Linux Linux thor 2.6.32-42-generic #95-Ubuntu SMP Wed Jul 25 15:57:54 UTC 2012 i686 GNU/Linux Linux archie 3.6.0-rc2-DEB #175 SMP PREEMPT Fri Aug 17 05:00:46 IST 2012 x86_64 GNU/Linux Googling turned up this patch, which may be relevant (but I am not sufficiently familiar with the ext3/ext4 code to really know): http://osdir.com/ml/file-systems.ext4/2008-02/msg00128.html To run the test program, just `gcc ext4bug-minimal.c && ./a.out`. After crash, the last record in each file can be compared with eg. tail -1 append.log perl -lne '$x = $_ if $_ > $x; END {print $x}' inplace.log Please let me know if there is any additional information that I can supply to help. (BTW, the problem is fixed/worked-around in MariaDB by using fsync() instead of fdatasync() when a file is extended.) - Kristian. #include #include #include #include #include #include #include #include #define BLOCKSIZE 4096 #define RECSIZE 8 #define COUNT 100 static void do_test(int append_fd, int inplace_fd, int use_fdatasync) { char buf[RECSIZE+1]; char buf2[BLOCKSIZE]; int seq; int res; memset(buf2, 0, sizeof(buf2)); for (seq = 0; seq < COUNT; ++seq) { printf("Adding log entry %d...\n", seq); snprintf(buf, sizeof(buf), "%7d\n", seq); res = write(append_fd, buf, RECSIZE); if (res != RECSIZE) { fprintf(stderr, "Unexpected return %d from write() (errno=%d: %s)\n", res, errno, strerror(errno)); exit(1); } res = use_fdatasync ? fdatasync(append_fd) : fsync(append_fd); if (res) { fprintf(stderr, "Error in fdatasync() of append file (errno=%d: %s)\n", errno, strerror(errno)); exit(1); } memcpy(buf2 + (seq * RECSIZE) % BLOCKSIZE, buf, RECSIZE); res = pwrite(inplace_fd, buf2, BLOCKSIZE, 0); if (res != BLOCKSIZE) { fprintf(stderr, "Unexpected return %d from pwrite() (errno=%d: %s)\n", res, errno, strerror(errno)); exit(1); } res = fdatasync(inplace_fd); if (res) { fprintf(stderr, "Error in fdatasync() of inplace file (errno=%d: %s)\n", errno, strerror(errno)); exit(1); } } } int main(int argc, char *argv[]) { int append_fd, inplace_fd; int res; int use_fdatasync = 1; unsigned char buf[BLOCKSIZE]; if (argc > 1 && !strcmp("--fsync", argv[1])) use_fdatasync = 0; append_fd = open("append.log", O_WRONLY|O_CREAT|O_TRUNC, 0666); if (append_fd < 0) { fprintf(stderr, "Failed to open
[ext3/4] PROBLEM: fdatasync not syncing appended data (w/test program)
It appears that ext3 and ext4 fdatasync() does not fully sync data to disk. Specifically, when new data is written at the end (so that the file length is increased), not all of the new data is synced by fdatasync(). It looks as if the problem is when the new data fits in the last allocated page for the file - then ext3/4 flushes the new data page to disk, but _not_ the new metadata with the longer file size. This seems in violation of how fdatasync() is supposed to perform, ie. from the Linux man page: fdatasync() ... does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled ... For example, changes to st_atime or st_mtime ... do not require flushing ... On the other hand, a change to the file size (st_size, as made by say ftruncate(2)), would require a metadata flush. It also causes real problems - it was found because it causes silent data corruption on ext3/ext4 of MySQL/MariaDB in case of a crash. XFS appears not to be affected. To reproduce, I run code like this (full test program attached) with disk cache disabled, and pull the power plug: for (seq = 0; seq COUNT; ++seq) { snprintf(buf, sizeof(buf), %7d\n, seq); write(append_fd, buf, RECSIZE); fdatasync(append_fd); memcpy(buf2 + (seq * RECSIZE) % BLOCKSIZE, buf, RECSIZE); pwrite(inplace_fd, buf2, BLOCKSIZE, 0); fdatasync(inplace_fd); } The append_fd is a new file that is appended to in each write; the inplace_fd is a pre-allocated file that does not change size. After a crash the two files should always be in sync (or append_fd can be one record ahead). But on ext3 and ext4, append_fd typically lacks several records. This does not happen on XFS, or if fdatasync(append_fd) is replaced with fsync(append_fd). I also tested this in kvm, and used strace on the host process to see exactly which writes the guest does. Each fdatasync(append_fd) causes a flush to disk of the data page of append_fd, but not of the inode (or journal) page. So even if fdatasync() is by design not suitable for appending data to a file, there still seems to be a bug, though then a performance rather than a correctness bug: there is no reason to flush the new data block if the increased file length is not flushed also. I tested the following kernels, all show the problem: Linux vm-lucid-amd64 2.6.32-39-server #86-Ubuntu SMP Mon Feb 13 23:15:11 UTC 2012 x86_64 GNU/Linux Linux thor 2.6.32-42-generic #95-Ubuntu SMP Wed Jul 25 15:57:54 UTC 2012 i686 GNU/Linux Linux archie 3.6.0-rc2-DEB #175 SMP PREEMPT Fri Aug 17 05:00:46 IST 2012 x86_64 GNU/Linux Googling turned up this patch, which may be relevant (but I am not sufficiently familiar with the ext3/ext4 code to really know): http://osdir.com/ml/file-systems.ext4/2008-02/msg00128.html To run the test program, just `gcc ext4bug-minimal.c ./a.out`. After crash, the last record in each file can be compared with eg. tail -1 append.log perl -lne '$x = $_ if $_ $x; END {print $x}' inplace.log Please let me know if there is any additional information that I can supply to help. (BTW, the problem is fixed/worked-around in MariaDB by using fsync() instead of fdatasync() when a file is extended.) - Kristian. #include string.h #include stdlib.h #include stdio.h #include errno.h #include unistd.h #include sys/types.h #include sys/stat.h #include fcntl.h #define BLOCKSIZE 4096 #define RECSIZE 8 #define COUNT 100 static void do_test(int append_fd, int inplace_fd, int use_fdatasync) { char buf[RECSIZE+1]; char buf2[BLOCKSIZE]; int seq; int res; memset(buf2, 0, sizeof(buf2)); for (seq = 0; seq COUNT; ++seq) { printf(Adding log entry %d...\n, seq); snprintf(buf, sizeof(buf), %7d\n, seq); res = write(append_fd, buf, RECSIZE); if (res != RECSIZE) { fprintf(stderr, Unexpected return %d from write() (errno=%d: %s)\n, res, errno, strerror(errno)); exit(1); } res = use_fdatasync ? fdatasync(append_fd) : fsync(append_fd); if (res) { fprintf(stderr, Error in fdatasync() of append file (errno=%d: %s)\n, errno, strerror(errno)); exit(1); } memcpy(buf2 + (seq * RECSIZE) % BLOCKSIZE, buf, RECSIZE); res = pwrite(inplace_fd, buf2, BLOCKSIZE, 0); if (res != BLOCKSIZE) { fprintf(stderr, Unexpected return %d from pwrite() (errno=%d: %s)\n, res, errno, strerror(errno)); exit(1); } res = fdatasync(inplace_fd); if (res) { fprintf(stderr, Error in fdatasync() of inplace file (errno=%d: %s)\n, errno, strerror(errno)); exit(1); } } } int main(int argc, char *argv[]) { int append_fd, inplace_fd; int res; int use_fdatasync = 1; unsigned char buf[BLOCKSIZE]; if (argc 1 !strcmp(--fsync, argv[1])) use_fdatasync = 0; append_fd = open(append.log, O_WRONLY|O_CREAT|O_TRUNC, 0666); if (append_fd 0)