Re: GPF in aio_migratepage

2014-01-21 Thread Kristian Nielsen
Kristian Nielsen  writes:

> Benjamin LaHaise  writes:
>
>> Linus just pushed out 3.13-rc5 that has changes to aio_migratepage() that 
>> should make it much more robust, as well as other fixes.  Can you please 
>> give it a spin as well and let me know if it works?  Thanks a bunch!
>
> Ok, will do.

JFYI, I have been running -rc5 (and later -rc7) for a month now without seeing
anything like this again. So hopefully the issue is solved.

 - Kristian.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPF in aio_migratepage

2014-01-21 Thread Kristian Nielsen
Kristian Nielsen kniel...@knielsen-hq.org writes:

 Benjamin LaHaise b...@kvack.org writes:

 Linus just pushed out 3.13-rc5 that has changes to aio_migratepage() that 
 should make it much more robust, as well as other fixes.  Can you please 
 give it a spin as well and let me know if it works?  Thanks a bunch!

 Ok, will do.

JFYI, I have been running -rc5 (and later -rc7) for a month now without seeing
anything like this again. So hopefully the issue is solved.

 - Kristian.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPF in aio_migratepage

2013-12-22 Thread Kristian Nielsen
Benjamin LaHaise  writes:

> Linus just pushed out 3.13-rc5 that has changes to aio_migratepage() that 
> should make it much more robust, as well as other fixes.  Can you please 
> give it a spin as well and let me know if it works?  Thanks a bunch!

Ok, will do.

 - Kristian.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPF in aio_migratepage

2013-12-22 Thread Kristian Nielsen
Gu Zheng  writes:

> This issue seems like a problem that has been fixed yet:
> http://article.gmane.org/gmane.linux.kernel.aio.general/3741/match=potential+use+after+free+aio%5fmigratepage
> commit 5e9ae2e5da0beb93f8557fc92a8f4fbc05ea448f
> aio: fix use-after-free in aio_migratepage
> So I think maybe you can run with latest Linus' tree or 3.13-rc4 to
> check whether this issue still appears.

Hm. I checked that thread, and as far as I can see, that patch was already
included in the tree I hit the BUG in (3.13-rc1):


http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5e9ae2e5da0beb93f8557fc92a8f4fbc05ea448f

There are other changes in that area since 3.13-rc1 though.

Anyway, I am now running with 3.13-rc4 and will report if I see anything.
Given that I do not have any way to reproduce (I only ever saw this once),
this seems the best that can be done for now.

Thanks for following up on this!

 - Kristian.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPF in aio_migratepage

2013-12-22 Thread Kristian Nielsen
Gu Zheng guz.f...@cn.fujitsu.com writes:

 This issue seems like a problem that has been fixed yet:
 http://article.gmane.org/gmane.linux.kernel.aio.general/3741/match=potential+use+after+free+aio%5fmigratepage
 commit 5e9ae2e5da0beb93f8557fc92a8f4fbc05ea448f
 aio: fix use-after-free in aio_migratepage
 So I think maybe you can run with latest Linus' tree or 3.13-rc4 to
 check whether this issue still appears.

Hm. I checked that thread, and as far as I can see, that patch was already
included in the tree I hit the BUG in (3.13-rc1):


http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5e9ae2e5da0beb93f8557fc92a8f4fbc05ea448f

There are other changes in that area since 3.13-rc1 though.

Anyway, I am now running with 3.13-rc4 and will report if I see anything.
Given that I do not have any way to reproduce (I only ever saw this once),
this seems the best that can be done for now.

Thanks for following up on this!

 - Kristian.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPF in aio_migratepage

2013-12-22 Thread Kristian Nielsen
Benjamin LaHaise b...@kvack.org writes:

 Linus just pushed out 3.13-rc5 that has changes to aio_migratepage() that 
 should make it much more robust, as well as other fixes.  Can you please 
 give it a spin as well and let me know if it works?  Thanks a bunch!

Ok, will do.

 - Kristian.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPF in aio_migratepage

2013-12-15 Thread Kristian Nielsen
What is the status of this?

If I understand correctly, the crash I saw is different from what Dave
saw.

There was one patched scheduled for inclusion that fixes Dave's crash. But
what about mine? I have been running 3.13-rc2 for a couple of weeks now with
your other patch, without seeing it again, which suggests it has helped. But
it seems that patch has a locking bug as described by Dave (sleeping under
spinlock)? So this appears unsolved as of yet...

So I just wanted to check that this was not forgotten. Is there something I
can do to help get this sorted out? Should I try to run with unpatched -rc4
for some time to check if it appears again? Anything else?

 - Kristian.

Dave Jones  writes:

> On Mon, Dec 02, 2013 at 06:10:46PM +0800, Gu Zheng wrote:
>  > Hi Kristian, Dave,
>  > 
>  > Could you please help to check whether the following patch can fix this 
> issue?
>
> This introduces some locking bugs..
>
>
> [  222.327950] BUG: sleeping function called from invalid context at 
> kernel/locking/mutex.c:616
> [  222.328004] in_atomic(): 1, irqs_disabled(): 0, pid: 12794, name: 
> trinity-child1
> [  222.328044] 1 lock held by trinity-child1/12794:
> [  222.328072]  #0:  (&(>private_lock)->rlock){+.+...}, at: 
> [] aio_free_ring+0x44/0x160
> [  222.328147] CPU: 1 PID: 12794 Comm: trinity-child1 Not tainted 3.13.0-rc2+ 
> #12 
> [  222.328268]  0268 880229517d68 8173bc52 
> 
> [  222.328320]  880229517d90 8108ad95 880223b6acd0 
> 
> [  222.328370]   880229517e08 81741cf3 
> 880229517dc0
> [  222.328421] Call Trace:
> [  222.328443]  [] dump_stack+0x4e/0x7a
> [  222.328475]  [] __might_sleep+0x175/0x200
> [  222.328510]  [] mutex_lock_nested+0x33/0x400
> [  222.328545]  [] unmap_mapping_range+0x68/0x170
> [  222.328582]  [] truncate_pagecache+0x35/0x60
> [  222.328617]  [] truncate_setsize+0x12/0x20
> [  222.328651]  [] aio_free_ring+0x99/0x160
> [  222.328684]  [] SyS_io_setup+0xef1/0xf00
> [  222.328717]  [] tracesys+0xdd/0xe2
>
> [  222.328769] ==
> [  222.328804] [ INFO: possible circular locking dependency detected ]
> [  222.328838] 3.13.0-rc2+ #12 Not tainted
> [  222.328862] ---
> [  222.328896] trinity-child1/12794 is trying to acquire lock:
> [  222.328928]  (>i_mmap_mutex){+.+...}, at: [] 
> unmap_mapping_range+0x68/0x170
> [  222.328987] 
> but task is already holding lock:
> [  222.329020]  (&(>private_lock)->rlock){+.+...}, at: 
> [] aio_free_ring+0x44/0x160
> [  222.329081] 
> which lock already depends on the new lock.
>
> [  222.329125] 
> the existing dependency chain (in reverse order) is:
> [  222.329166] 
> -> #2 (&(>private_lock)->rlock){+.+...}:
> [  222.329211][] lock_acquire+0x93/0x1c0
> [  222.329248][] _raw_spin_lock+0x40/0x80
> [  222.329285][] __set_page_dirty_buffers+0x2d/0xb0
> [  222.331243][] set_page_dirty+0x3a/0x60
> [  222.334437][] unmap_single_vma+0x62f/0x830
> [  222.337633][] unmap_vmas+0x49/0x90
> [  222.340819][] unmap_region+0x9d/0x110
> [  222.343968][] do_munmap+0x226/0x3b0
> [  222.346689][] vm_munmap+0x44/0x60
> [  222.349741][] SyS_munmap+0x22/0x30
> [  222.352758][] tracesys+0xdd/0xe2
> [  222.355735] 
> -> #1 (&(ptlock_ptr(page))->rlock#2){+.+...}:
> [  222.361611][] lock_acquire+0x93/0x1c0
> [  222.364589][] _raw_spin_lock+0x40/0x80
> [  222.367200][] __page_check_address+0x98/0x160
> [  222.370168][] page_mkclean+0xfe/0x1c0
> [  222.373120][] clear_page_dirty_for_io+0x60/0x100
> [  222.376076][] mpage_submit_page+0x47/0x80
> [  222.379015][] mpage_process_page_bufs+0x110/0x130
> [  222.381955][] 
> mpage_prepare_extent_to_map+0x22b/0x2f0
> [  222.384895][] ext4_writepages+0x4ef/0x1050
> [  222.387839][] do_writepages+0x21/0x50
> [  222.390786][] 
> __filemap_fdatawrite_range+0x59/0x60
> [  222.393747][] 
> filemap_write_and_wait_range+0x2d/0x70
> [  222.396729][] ext4_sync_file+0xba/0x4d0
> [  222.399714][] do_fsync+0x51/0x80
> [  222.402317][] SyS_fsync+0x10/0x20
> [  222.405240][] tracesys+0xdd/0xe2
> [  222.407760] 
> -> #0 (>i_mmap_mutex){+.+...}:
> [  222.413349][] __lock_acquire+0x1786/0x1af0
> [  222.416127][] lock_acquire+0x93/0x1c0
> [  222.418826][] mutex_lock_nested+0x77/0x400
> [  222.421456][] unmap_mapping_range+0x68/0x170
> [  222.424085][] truncate_pagecache+0x35/0x60
> [  222.426696][] truncate_setsize+0x12/0x20
> [  222.428955][] aio_free_ring+0x99/0x160
> [  222.431509][] SyS_io_setup+0xef1/0xf00
> [  222.434069][] tracesys+0xdd/0xe2
> [  222.436308] 
> other info that might help us debug this:
>
> [  222.443857] Chain exists of:
>   

Re: GPF in aio_migratepage

2013-12-15 Thread Kristian Nielsen
What is the status of this?

If I understand correctly, the crash I saw is different from what Dave
saw.

There was one patched scheduled for inclusion that fixes Dave's crash. But
what about mine? I have been running 3.13-rc2 for a couple of weeks now with
your other patch, without seeing it again, which suggests it has helped. But
it seems that patch has a locking bug as described by Dave (sleeping under
spinlock)? So this appears unsolved as of yet...

So I just wanted to check that this was not forgotten. Is there something I
can do to help get this sorted out? Should I try to run with unpatched -rc4
for some time to check if it appears again? Anything else?

 - Kristian.

Dave Jones da...@redhat.com writes:

 On Mon, Dec 02, 2013 at 06:10:46PM +0800, Gu Zheng wrote:
   Hi Kristian, Dave,
   
   Could you please help to check whether the following patch can fix this 
 issue?

 This introduces some locking bugs..


 [  222.327950] BUG: sleeping function called from invalid context at 
 kernel/locking/mutex.c:616
 [  222.328004] in_atomic(): 1, irqs_disabled(): 0, pid: 12794, name: 
 trinity-child1
 [  222.328044] 1 lock held by trinity-child1/12794:
 [  222.328072]  #0:  ((mapping-private_lock)-rlock){+.+...}, at: 
 [81210a64] aio_free_ring+0x44/0x160
 [  222.328147] CPU: 1 PID: 12794 Comm: trinity-child1 Not tainted 3.13.0-rc2+ 
 #12 
 [  222.328268]  0268 880229517d68 8173bc52 
 
 [  222.328320]  880229517d90 8108ad95 880223b6acd0 
 
 [  222.328370]   880229517e08 81741cf3 
 880229517dc0
 [  222.328421] Call Trace:
 [  222.328443]  [8173bc52] dump_stack+0x4e/0x7a
 [  222.328475]  [8108ad95] __might_sleep+0x175/0x200
 [  222.328510]  [81741cf3] mutex_lock_nested+0x33/0x400
 [  222.328545]  [81179e68] unmap_mapping_range+0x68/0x170
 [  222.328582]  [81160a35] truncate_pagecache+0x35/0x60
 [  222.328617]  [81160a72] truncate_setsize+0x12/0x20
 [  222.328651]  [81210ab9] aio_free_ring+0x99/0x160
 [  222.328684]  [81213071] SyS_io_setup+0xef1/0xf00
 [  222.328717]  [8174eaa4] tracesys+0xdd/0xe2

 [  222.328769] ==
 [  222.328804] [ INFO: possible circular locking dependency detected ]
 [  222.328838] 3.13.0-rc2+ #12 Not tainted
 [  222.328862] ---
 [  222.328896] trinity-child1/12794 is trying to acquire lock:
 [  222.328928]  (mapping-i_mmap_mutex){+.+...}, at: [81179e68] 
 unmap_mapping_range+0x68/0x170
 [  222.328987] 
 but task is already holding lock:
 [  222.329020]  ((mapping-private_lock)-rlock){+.+...}, at: 
 [81210a64] aio_free_ring+0x44/0x160
 [  222.329081] 
 which lock already depends on the new lock.

 [  222.329125] 
 the existing dependency chain (in reverse order) is:
 [  222.329166] 
 - #2 ((mapping-private_lock)-rlock){+.+...}:
 [  222.329211][810af833] lock_acquire+0x93/0x1c0
 [  222.329248][817454f0] _raw_spin_lock+0x40/0x80
 [  222.329285][811f334d] __set_page_dirty_buffers+0x2d/0xb0
 [  222.331243][8115aada] set_page_dirty+0x3a/0x60
 [  222.334437][81179a7f] unmap_single_vma+0x62f/0x830
 [  222.337633][8117ad19] unmap_vmas+0x49/0x90
 [  222.340819][811804bd] unmap_region+0x9d/0x110
 [  222.343968][811829f6] do_munmap+0x226/0x3b0
 [  222.346689][81182bc4] vm_munmap+0x44/0x60
 [  222.349741][81183b42] SyS_munmap+0x22/0x30
 [  222.352758][8174eaa4] tracesys+0xdd/0xe2
 [  222.355735] 
 - #1 ((ptlock_ptr(page))-rlock#2){+.+...}:
 [  222.361611][810af833] lock_acquire+0x93/0x1c0
 [  222.364589][817454f0] _raw_spin_lock+0x40/0x80
 [  222.367200][81186338] __page_check_address+0x98/0x160
 [  222.370168][811864fe] page_mkclean+0xfe/0x1c0
 [  222.373120][8115ad60] clear_page_dirty_for_io+0x60/0x100
 [  222.376076][8124d207] mpage_submit_page+0x47/0x80
 [  222.379015][8124d350] mpage_process_page_bufs+0x110/0x130
 [  222.381955][8124d91b] 
 mpage_prepare_extent_to_map+0x22b/0x2f0
 [  222.384895][8125318f] ext4_writepages+0x4ef/0x1050
 [  222.387839][8115cdf1] do_writepages+0x21/0x50
 [  222.390786][81150959] 
 __filemap_fdatawrite_range+0x59/0x60
 [  222.393747][81150a5d] 
 filemap_write_and_wait_range+0x2d/0x70
 [  222.396729][812498ca] ext4_sync_file+0xba/0x4d0
 [  222.399714][811f1691] do_fsync+0x51/0x80
 [  222.402317][811f1980] SyS_fsync+0x10/0x20
 [  222.405240][8174eaa4] tracesys+0xdd/0xe2
 [  222.407760] 
 - #0 (mapping-i_mmap_mutex){+.+...}:
 [  222.413349]   

Re: GPF in aio_migratepage

2013-12-02 Thread Kristian Nielsen
Gu Zheng  writes:

> Hi Kristian, Dave,
>
> Could you please help to check whether the following patch can fix this issue?

> Signed-off-by: Gu Zheng 
> ---
>  fs/aio.c |   28 ++--
>  1 files changed, 10 insertions(+), 18 deletions(-)
>

Ok. I've applied the patch to 3.13-rc2 and will give it a spin. I will let you
know if I encounter the failure again with the patch.

Thanks,

 - Kristian.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPF in aio_migratepage

2013-12-02 Thread Kristian Nielsen
Gu Zheng guz.f...@cn.fujitsu.com writes:

 Hi Kristian, Dave,

 Could you please help to check whether the following patch can fix this issue?

 Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
 ---
  fs/aio.c |   28 ++--
  1 files changed, 10 insertions(+), 18 deletions(-)


Ok. I've applied the patch to 3.13-rc2 and will give it a spin. I will let you
know if I encounter the failure again with the patch.

Thanks,

 - Kristian.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: GPF in aio_migratepage

2013-11-30 Thread Kristian Nielsen
Benjamin LaHaise  writes:

> For Dave: what line is this bug on?  Is it the dereference of ctx when 
> doing spin_lock_irqsave(>completion_lock, flags); or is the 
> ctx->ring_pages[idx] = new; ?  From the 64 bit splat, I'm thinking the 
> former, which is quite strange given that the clearing of 
> mapping->private_data is protected by mapping->private_lock.  If it's 
> the latter, we might well need to check if ctx->ring_pages is NULL during 
> setup. 

I think I got the same BUG (at least it looks very similar, full details
below).

The bug is on this line:

ctx->ring_pages[idx] = new;

Disassembly:

af7:   48 89 2c d1mov%rbp,(%rcx,%rdx,8)

ctx->ring_pages is 0x (this is x86_64). idx is 13.

  RCX:   RDX: 000d
  BUG: unable to handle kernel NULL pointer dereference at 0067

So we are de-referencing a pointer that is (page **)-1, causing the crash.

If you look closer at the 32-bit dump that Dave gave, you can see that it is
similar:

 7a2:   89 34 82mov%esi,(%edx,%eax,4)

  RAX: 6b6b6b6b6b6b6b6b  RDX: 

Though in this case ctx->ring_pages seems to be NULL and idx=old->index seems
to be 6b6b6b6b6b6b6b6b, so not completely the same (or maybe I read his dump
incorrectly).

This is 3.13-rc1. Unfortunately, I do not have a way to reproduce (so far I
only saw it this once). But I can see if it turns up again, or should I
install -rc2 and see if it goes away?

I was not doing anything special at the time, normal desktop load (I was using
the evince pdf viewer).

Let me know if there is anything else I can do to help track this down?

 - Kristian.

Full details:

I put my .config here:

http://knielsen-hq.org/config-3.13-rc1-gpf-in-aio-migratepage.txt

BUG output:

BUG: unable to handle kernel NULL pointer dereference at 0067
IP: [] aio_migratepage+0xb3/0xe4
PGD 0 
Oops: 0002 [#1] SMP 
Modules linked in: tun parport_pc ppdev lp parport bnep rfcomm bluetooth 
cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative 
binfmt_misc uinput fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache 
sunrpc ext3 jbd loop snd_hda_codec_hdmi hid_generic usbhid hid joydev 
ums_realtek usb_storage snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support arc4 
brcmsmac cordic brcmutil b43 mac80211 cfg80211 ssb mmc_core rfkill rng_core 
pcmcia pcmcia_core nouveau mxm_wmi wmi x86_pkg_temp_thermal coretemp 
snd_hda_intel kvm_intel snd_hda_codec snd_hwdep snd_pcm_oss kvm snd_mixer_oss 
snd_seq_midi snd_seq_midi_event snd_pcm crc32c_intel snd_rawmidi snd_page_alloc 
snd_seq ghash_clmulni_intel snd_timer snd_seq_device lpc_ich aesni_intel 
mfd_core ttm battery aes_x86_64 ablk_helper drm_kms_helper cryptd lrw gf128mul 
drm glue_helper psmouse snd pcspkr serio_raw i2c_i801 evdev ehci_pci soundcore 
ehci_hcd bcma ac acpi_cpufreq video button processor ext4 crc16 jbd2 mbc
r_mod cdrom crc_t10dif crct10dif_common microcode ahci libahci xhci_hcd libata 
usbcore scsi_mod usb_common fan thermal thermal_sys r8169 mii
CPU: 2 PID: 15596 Comm: evince Not tainted 3.13.0-rc1-kn #1
Hardware name: Compal PBL2021/Base Board Product Name, BIOS 2.40 08/26/2011
task: 88010322f7c0 ti: 880102b48000 task.ti: 880102b48000
RIP: 0010:[]  [] aio_migratepage+0xb3/0xe4
RSP: 0018:880102b49798  EFLAGS: 00010213
RAX: 0286 RBX: ea00038f1640 RCX: 
RDX: 000d RSI: ea00038f1640 RDI: ea00038f1640
RBP: ea0007b6a800 R08:  R09: 000d
R10: 0038 R11: ea0007b6a800 R12: 880144a30d00
R13:  R14: 88014ba5b1f8 R15: 880144a30ec4
FS:  7f68ecfe8960() GS:88024f48() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0067 CR3: 51ee8000 CR4: 000407e0
Stack:
 000e 0286 88024f7f6d80 ea00038f1640
 ea0007b6a800  88014ba5b170 0001
 0001 810ffc68 88014ba5b1a8 
Call Trace:
 [] ? move_to_new_page+0x84/0x1ab
 [] ? get_page+0x9/0x25
 [] ? migrate_pages+0x330/0x524
 [] ? isolate_freepages_block+0x237/0x237
 [] ? compact_zone+0x13a/0x301
 [] ? compact_zone_order+0x94/0xa7
 [] ? try_to_compact_pages+0x98/0xec
 [] ? __alloc_pages_direct_compact+0xa9/0x19a
 [] ? __alloc_pages_nodemask+0x46f/0x7f3
 [] ? __kmalloc_reserve.isra.42+0x2a/0x6d
 [] ? alloc_pages_current+0xac/0xc6
 [] ? sock_alloc_send_pskb+0x1fc/0x345
 [] ? memcpy_fromiovecend+0x48/0x6f
 [] ? skb_copy_datagram_from_iovec+0x128/0x1f2
 [] ? sk_wake_async+0x19/0x3c
 [] ? unix_stream_sendmsg+0x12e/0x2e9
 [] ? sock_aio_write+0xc0/0xd5
 [] ? set_restore_sigmask+0x2d/0x2d
 [] ? do_sync_readv_writev+0x48/0x6b
 [] ? sock_alloc_file+0x119/0x119
 [] ? do_readv_writev+0xb4/0x121
 [] ? sock_alloc_file+0x119/0x119
 [] ? __switch_to+0x1b1/0x3de
 [] ? fget_light+0x6b/0x7c
 [] ? fdget+0xe/0x17
 [] 

Re: GPF in aio_migratepage

2013-11-30 Thread Kristian Nielsen
Benjamin LaHaise b...@kvack.org writes:

 For Dave: what line is this bug on?  Is it the dereference of ctx when 
 doing spin_lock_irqsave(ctx-completion_lock, flags); or is the 
 ctx-ring_pages[idx] = new; ?  From the 64 bit splat, I'm thinking the 
 former, which is quite strange given that the clearing of 
 mapping-private_data is protected by mapping-private_lock.  If it's 
 the latter, we might well need to check if ctx-ring_pages is NULL during 
 setup. 

I think I got the same BUG (at least it looks very similar, full details
below).

The bug is on this line:

ctx-ring_pages[idx] = new;

Disassembly:

af7:   48 89 2c d1mov%rbp,(%rcx,%rdx,8)

ctx-ring_pages is 0x (this is x86_64). idx is 13.

  RCX:   RDX: 000d
  BUG: unable to handle kernel NULL pointer dereference at 0067

So we are de-referencing a pointer that is (page **)-1, causing the crash.

If you look closer at the 32-bit dump that Dave gave, you can see that it is
similar:

 7a2:   89 34 82mov%esi,(%edx,%eax,4)

  RAX: 6b6b6b6b6b6b6b6b  RDX: 

Though in this case ctx-ring_pages seems to be NULL and idx=old-index seems
to be 6b6b6b6b6b6b6b6b, so not completely the same (or maybe I read his dump
incorrectly).

This is 3.13-rc1. Unfortunately, I do not have a way to reproduce (so far I
only saw it this once). But I can see if it turns up again, or should I
install -rc2 and see if it goes away?

I was not doing anything special at the time, normal desktop load (I was using
the evince pdf viewer).

Let me know if there is anything else I can do to help track this down?

 - Kristian.

Full details:

I put my .config here:

http://knielsen-hq.org/config-3.13-rc1-gpf-in-aio-migratepage.txt

BUG output:

BUG: unable to handle kernel NULL pointer dereference at 0067
IP: [8113d73f] aio_migratepage+0xb3/0xe4
PGD 0 
Oops: 0002 [#1] SMP 
Modules linked in: tun parport_pc ppdev lp parport bnep rfcomm bluetooth 
cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative 
binfmt_misc uinput fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache 
sunrpc ext3 jbd loop snd_hda_codec_hdmi hid_generic usbhid hid joydev 
ums_realtek usb_storage snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support arc4 
brcmsmac cordic brcmutil b43 mac80211 cfg80211 ssb mmc_core rfkill rng_core 
pcmcia pcmcia_core nouveau mxm_wmi wmi x86_pkg_temp_thermal coretemp 
snd_hda_intel kvm_intel snd_hda_codec snd_hwdep snd_pcm_oss kvm snd_mixer_oss 
snd_seq_midi snd_seq_midi_event snd_pcm crc32c_intel snd_rawmidi snd_page_alloc 
snd_seq ghash_clmulni_intel snd_timer snd_seq_device lpc_ich aesni_intel 
mfd_core ttm battery aes_x86_64 ablk_helper drm_kms_helper cryptd lrw gf128mul 
drm glue_helper psmouse snd pcspkr serio_raw i2c_i801 evdev ehci_pci soundcore 
ehci_hcd bcma ac acpi_cpufreq video button processor ext4 crc16 jbd2 mbc
r_mod cdrom crc_t10dif crct10dif_common microcode ahci libahci xhci_hcd libata 
usbcore scsi_mod usb_common fan thermal thermal_sys r8169 mii
CPU: 2 PID: 15596 Comm: evince Not tainted 3.13.0-rc1-kn #1
Hardware name: Compal PBL2021/Base Board Product Name, BIOS 2.40 08/26/2011
task: 88010322f7c0 ti: 880102b48000 task.ti: 880102b48000
RIP: 0010:[8113d73f]  [8113d73f] aio_migratepage+0xb3/0xe4
RSP: 0018:880102b49798  EFLAGS: 00010213
RAX: 0286 RBX: ea00038f1640 RCX: 
RDX: 000d RSI: ea00038f1640 RDI: ea00038f1640
RBP: ea0007b6a800 R08:  R09: 000d
R10: 0038 R11: ea0007b6a800 R12: 880144a30d00
R13:  R14: 88014ba5b1f8 R15: 880144a30ec4
FS:  7f68ecfe8960() GS:88024f48() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0067 CR3: 51ee8000 CR4: 000407e0
Stack:
 000e 0286 88024f7f6d80 ea00038f1640
 ea0007b6a800  88014ba5b170 0001
 0001 810ffc68 88014ba5b1a8 
Call Trace:
 [810ffc68] ? move_to_new_page+0x84/0x1ab
 [810cbcbd] ? get_page+0x9/0x25
 [8110019e] ? migrate_pages+0x330/0x524
 [810dac77] ? isolate_freepages_block+0x237/0x237
 [810db651] ? compact_zone+0x13a/0x301
 [810dba3e] ? compact_zone_order+0x94/0xa7
 [810dbae9] ? try_to_compact_pages+0x98/0xec
 [8138ef42] ? __alloc_pages_direct_compact+0xa9/0x19a
 [810c8567] ? __alloc_pages_nodemask+0x46f/0x7f3
 [812cf2bc] ? __kmalloc_reserve.isra.42+0x2a/0x6d
 [810f64df] ? alloc_pages_current+0xac/0xc6
 [812cbd47] ? sock_alloc_send_pskb+0x1fc/0x345
 [812d2625] ? memcpy_fromiovecend+0x48/0x6f
 [812d2ac5] ? skb_copy_datagram_from_iovec+0x128/0x1f2
 [812ca529] ? sk_wake_async+0x19/0x3c
 [8134c605] ? 

Re: [ext3/4] PROBLEM: fdatasync not syncing appended data (w/test program)

2012-09-04 Thread Kristian Nielsen
Jan Kara  writes:

> On Mon 03-09-12 10:45:15, Kristian Nielsen wrote:
>> It appears that ext3 and ext4 fdatasync() does not fully sync data to
>> disk. Specifically, when new data is written at the end (so that the file
>> length is increased), not all of the new data is synced by fdatasync().

> The culprit is that we forget to update i_datasync_tid when we change only
> inode size. Thus inode is not forced to disk during fdatasync(). I will send
> fixes for this in a moment.

Cool, thanks for the quick response! Glad that we get this fixed.

 - Kristian.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ext3/4] PROBLEM: fdatasync not syncing appended data (w/test program)

2012-09-04 Thread Kristian Nielsen
Jan Kara j...@suse.cz writes:

 On Mon 03-09-12 10:45:15, Kristian Nielsen wrote:
 It appears that ext3 and ext4 fdatasync() does not fully sync data to
 disk. Specifically, when new data is written at the end (so that the file
 length is increased), not all of the new data is synced by fdatasync().

 The culprit is that we forget to update i_datasync_tid when we change only
 inode size. Thus inode is not forced to disk during fdatasync(). I will send
 fixes for this in a moment.

Cool, thanks for the quick response! Glad that we get this fixed.

 - Kristian.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ext3/4] PROBLEM: fdatasync not syncing appended data (w/test program)

2012-09-03 Thread Kristian Nielsen
It appears that ext3 and ext4 fdatasync() does not fully sync data to
disk. Specifically, when new data is written at the end (so that the file
length is increased), not all of the new data is synced by fdatasync().

It looks as if the problem is when the new data fits in the last allocated
page for the file - then ext3/4 flushes the new data page to disk, but _not_
the new metadata with the longer file size.

This seems in violation of how fdatasync() is supposed to perform, ie. from
the Linux man page:

  fdatasync() ... does not flush modified metadata unless that metadata is
  needed in order to allow a subsequent data retrieval to be correctly
  handled ...

  For example, changes to st_atime or st_mtime ... do not require flushing ...

  On the other hand, a change to the file size (st_size, as made by say
  ftruncate(2)), would require a metadata flush.

It also causes real problems - it was found because it causes silent data
corruption on ext3/ext4 of MySQL/MariaDB in case of a crash. XFS appears not
to be affected.

To reproduce, I run code like this (full test program attached) with disk
cache disabled, and pull the power plug:

  for (seq = 0; seq < COUNT; ++seq)
  {
snprintf(buf, sizeof(buf), "%7d\n", seq);
write(append_fd, buf, RECSIZE);
fdatasync(append_fd);
memcpy(buf2 + (seq * RECSIZE) % BLOCKSIZE, buf, RECSIZE);
pwrite(inplace_fd, buf2, BLOCKSIZE, 0);
fdatasync(inplace_fd);
  }

The append_fd is a new file that is appended to in each write; the inplace_fd
is a pre-allocated file that does not change size. After a crash the two files
should always be in sync (or append_fd can be one record ahead). But on ext3
and ext4, append_fd typically lacks several records. This does not happen on
XFS, or if fdatasync(append_fd) is replaced with fsync(append_fd).

I also tested this in kvm, and used strace on the host process to see exactly
which writes the guest does. Each fdatasync(append_fd) causes a flush to disk
of the data page of append_fd, but not of the inode (or journal) page. So even
if fdatasync() is by design not suitable for appending data to a file, there
still seems to be a bug, though then a performance rather than a correctness
bug: there is no reason to flush the new data block if the increased file
length is not flushed also.

I tested the following kernels, all show the problem:

  Linux vm-lucid-amd64 2.6.32-39-server #86-Ubuntu SMP Mon Feb 13 23:15:11 UTC 
2012 x86_64 GNU/Linux
  Linux thor 2.6.32-42-generic #95-Ubuntu SMP Wed Jul 25 15:57:54 UTC 2012 i686 
GNU/Linux
  Linux archie 3.6.0-rc2-DEB #175 SMP PREEMPT Fri Aug 17 05:00:46 IST 2012 
x86_64 GNU/Linux

Googling turned up this patch, which may be relevant (but I am not
sufficiently familiar with the ext3/ext4 code to really know):

http://osdir.com/ml/file-systems.ext4/2008-02/msg00128.html

To run the test program, just `gcc ext4bug-minimal.c && ./a.out`. After crash,
the last record in each file can be compared with eg.

tail -1 append.log
perl -lne '$x = $_ if $_ > $x; END {print $x}' inplace.log 

Please let me know if there is any additional information that I can supply to
help.

(BTW, the problem is fixed/worked-around in MariaDB by using fsync() instead
of fdatasync() when a file is extended.)

 - Kristian.

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define BLOCKSIZE 4096
#define RECSIZE 8
#define COUNT 100

static void
do_test(int append_fd, int inplace_fd, int use_fdatasync)
{
  char buf[RECSIZE+1];
  char buf2[BLOCKSIZE];
  int seq;
  int res;

  memset(buf2, 0, sizeof(buf2));
  for (seq = 0; seq < COUNT; ++seq)
  {
printf("Adding log entry %d...\n", seq);
snprintf(buf, sizeof(buf), "%7d\n", seq);

res = write(append_fd, buf, RECSIZE);
if (res != RECSIZE)
{
  fprintf(stderr, "Unexpected return %d from write() (errno=%d: %s)\n",
  res, errno, strerror(errno));
  exit(1);
}
res = use_fdatasync ? fdatasync(append_fd) : fsync(append_fd);
if (res)
{
  fprintf(stderr, "Error in fdatasync() of append file (errno=%d: %s)\n",
  errno, strerror(errno));
  exit(1);
}

memcpy(buf2 + (seq * RECSIZE) % BLOCKSIZE, buf, RECSIZE);
res = pwrite(inplace_fd, buf2, BLOCKSIZE, 0);
if (res != BLOCKSIZE)
{
  fprintf(stderr, "Unexpected return %d from pwrite() (errno=%d: %s)\n",
  res, errno, strerror(errno));
  exit(1);
}
res = fdatasync(inplace_fd);
if (res)
{
  fprintf(stderr, "Error in fdatasync() of inplace file (errno=%d: %s)\n",
  errno, strerror(errno));
  exit(1);
}
  }
}

int
main(int argc, char *argv[])
{
  int append_fd, inplace_fd;
  int res;
  int use_fdatasync = 1;
  unsigned char buf[BLOCKSIZE];

  if (argc > 1 && !strcmp("--fsync", argv[1]))
use_fdatasync = 0;

  append_fd = open("append.log", O_WRONLY|O_CREAT|O_TRUNC, 0666);
  if (append_fd < 0)
  {
fprintf(stderr, "Failed to open 

[ext3/4] PROBLEM: fdatasync not syncing appended data (w/test program)

2012-09-03 Thread Kristian Nielsen
It appears that ext3 and ext4 fdatasync() does not fully sync data to
disk. Specifically, when new data is written at the end (so that the file
length is increased), not all of the new data is synced by fdatasync().

It looks as if the problem is when the new data fits in the last allocated
page for the file - then ext3/4 flushes the new data page to disk, but _not_
the new metadata with the longer file size.

This seems in violation of how fdatasync() is supposed to perform, ie. from
the Linux man page:

  fdatasync() ... does not flush modified metadata unless that metadata is
  needed in order to allow a subsequent data retrieval to be correctly
  handled ...

  For example, changes to st_atime or st_mtime ... do not require flushing ...

  On the other hand, a change to the file size (st_size, as made by say
  ftruncate(2)), would require a metadata flush.

It also causes real problems - it was found because it causes silent data
corruption on ext3/ext4 of MySQL/MariaDB in case of a crash. XFS appears not
to be affected.

To reproduce, I run code like this (full test program attached) with disk
cache disabled, and pull the power plug:

  for (seq = 0; seq  COUNT; ++seq)
  {
snprintf(buf, sizeof(buf), %7d\n, seq);
write(append_fd, buf, RECSIZE);
fdatasync(append_fd);
memcpy(buf2 + (seq * RECSIZE) % BLOCKSIZE, buf, RECSIZE);
pwrite(inplace_fd, buf2, BLOCKSIZE, 0);
fdatasync(inplace_fd);
  }

The append_fd is a new file that is appended to in each write; the inplace_fd
is a pre-allocated file that does not change size. After a crash the two files
should always be in sync (or append_fd can be one record ahead). But on ext3
and ext4, append_fd typically lacks several records. This does not happen on
XFS, or if fdatasync(append_fd) is replaced with fsync(append_fd).

I also tested this in kvm, and used strace on the host process to see exactly
which writes the guest does. Each fdatasync(append_fd) causes a flush to disk
of the data page of append_fd, but not of the inode (or journal) page. So even
if fdatasync() is by design not suitable for appending data to a file, there
still seems to be a bug, though then a performance rather than a correctness
bug: there is no reason to flush the new data block if the increased file
length is not flushed also.

I tested the following kernels, all show the problem:

  Linux vm-lucid-amd64 2.6.32-39-server #86-Ubuntu SMP Mon Feb 13 23:15:11 UTC 
2012 x86_64 GNU/Linux
  Linux thor 2.6.32-42-generic #95-Ubuntu SMP Wed Jul 25 15:57:54 UTC 2012 i686 
GNU/Linux
  Linux archie 3.6.0-rc2-DEB #175 SMP PREEMPT Fri Aug 17 05:00:46 IST 2012 
x86_64 GNU/Linux

Googling turned up this patch, which may be relevant (but I am not
sufficiently familiar with the ext3/ext4 code to really know):

http://osdir.com/ml/file-systems.ext4/2008-02/msg00128.html

To run the test program, just `gcc ext4bug-minimal.c  ./a.out`. After crash,
the last record in each file can be compared with eg.

tail -1 append.log
perl -lne '$x = $_ if $_  $x; END {print $x}' inplace.log 

Please let me know if there is any additional information that I can supply to
help.

(BTW, the problem is fixed/worked-around in MariaDB by using fsync() instead
of fdatasync() when a file is extended.)

 - Kristian.

#include string.h
#include stdlib.h
#include stdio.h
#include errno.h
#include unistd.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h

#define BLOCKSIZE 4096
#define RECSIZE 8
#define COUNT 100

static void
do_test(int append_fd, int inplace_fd, int use_fdatasync)
{
  char buf[RECSIZE+1];
  char buf2[BLOCKSIZE];
  int seq;
  int res;

  memset(buf2, 0, sizeof(buf2));
  for (seq = 0; seq  COUNT; ++seq)
  {
printf(Adding log entry %d...\n, seq);
snprintf(buf, sizeof(buf), %7d\n, seq);

res = write(append_fd, buf, RECSIZE);
if (res != RECSIZE)
{
  fprintf(stderr, Unexpected return %d from write() (errno=%d: %s)\n,
  res, errno, strerror(errno));
  exit(1);
}
res = use_fdatasync ? fdatasync(append_fd) : fsync(append_fd);
if (res)
{
  fprintf(stderr, Error in fdatasync() of append file (errno=%d: %s)\n,
  errno, strerror(errno));
  exit(1);
}

memcpy(buf2 + (seq * RECSIZE) % BLOCKSIZE, buf, RECSIZE);
res = pwrite(inplace_fd, buf2, BLOCKSIZE, 0);
if (res != BLOCKSIZE)
{
  fprintf(stderr, Unexpected return %d from pwrite() (errno=%d: %s)\n,
  res, errno, strerror(errno));
  exit(1);
}
res = fdatasync(inplace_fd);
if (res)
{
  fprintf(stderr, Error in fdatasync() of inplace file (errno=%d: %s)\n,
  errno, strerror(errno));
  exit(1);
}
  }
}

int
main(int argc, char *argv[])
{
  int append_fd, inplace_fd;
  int res;
  int use_fdatasync = 1;
  unsigned char buf[BLOCKSIZE];

  if (argc  1  !strcmp(--fsync, argv[1]))
use_fdatasync = 0;

  append_fd = open(append.log, O_WRONLY|O_CREAT|O_TRUNC, 0666);
  if (append_fd  0)