On Wed, Dec 24, 2008 at 04:28:44PM +0100, Andrea Arcangeli wrote:
> On Wed, Dec 24, 2008 at 02:50:57PM +0200, Avi Kivity wrote:
> > Marcelo Tosatti wrote:
> >> The destructor for huge pages uses the backing inode for adjusting
> >> hugetlbfs accounting.
> >>
> >> Hugepage mappings are destroyed by exit_mmap, after
> >> mmu_notifier_release, so there are no notifications through
> >> unmap_hugepage_range at this point.
> >>
> >> The hugetlbfs inode can be freed with pages backed by it referenced
> >> by the shadow. When the shadow releases its reference, the huge page
> >> destructor will access a now freed inode.
> >>
> >> Implement the release operation for kvm mmu notifiers to release page
> >> refs before the hugetlbfs inode is gone.
> >>
> >>
> >
> > I see this isn't it. Andrea, comments?
>
> Yeah, the patch looks good, I talked a bit with Marcelo about this by
> PM. The issue is that it's not as strightforward as it seems,
> basically when I implemented the ->release handlers and had sptes
> teardown running before the files were closed (instead of waiting the
> kvm anon inode release handler to fire) I was getting bugchecks from
> debug options including preempt=y (certain debug checks only becomes
> functional with preempt enabled unfortunately), so eventually I
> removed ->release because for kvm ->release wasn't useful because no
> guest mode can run any more by the time mmu notifier ->release is
> invoked, and that avoided the issues with the bugchecks.
>
> We'll be using the mmu notifiers ->release because it's always called
> just before the filehandle are destroyed, it's not really about the
> guest mode or secondary mmu but just an ordering issue with hugetlbfs
> internals.
>
> So in short if no bugcheck triggers this is fine (at least until
> hugetlbfs provides a way to register some callback to invoke at the
> start of the hugetlbfs->release handler).
The only bugcheck I see, which triggers on vanilla kvm upstream with
CONFIG_PREEMPT_DEBUG=y and CONFIG_PREEMPT_RCU=y is:
general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC<4>ttyS1: 1
input overrun(s)
last sysfs file: /sys/class/net/tap0/address
CPU 0
Modules linked in: tun ipt_MASQUERADE iptable_nat nf_nat bridge stp llc
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT
iptable_filter ip_tables x_tables dm_multipath kvm_intel kvm scsi_wait_scan
ata_piix libata dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod
shpchp pci_hotplug mptsas mptscsih mptbase scsi_transport_sas uhci_hcd ohci_hcd
ehci_hcd
Pid: 4768, comm: qemu-system-x86 Not tainted 2.6.28-00165-g4f27e3e-dirty #164
RIP: 0010:[<ffffffff8028a5b6>] [<ffffffff8028a5b6>]
__purge_vmap_area_lazy+0x12c/0x163
RSP: 0018:ffff88021e1f9a38 EFLAGS: 00010202
RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000003
RDX: ffffffff80a1dae0 RSI: ffff880028083980 RDI: 0000000000000001
RBP: ffff88021e1f9a78 R08: 0000000000000286 R09: ffffffff80a1bf50
R10: ffff880119c270f8 R11: ffff88021e1f99b8 R12: ffff88021e1f9a38
R13: ffff88021e1f9a90 R14: ffff88021e1f9a98 R15: 000000000000813a
FS: 0000000000000000(0000) GS:ffffffff8080d900(0000) knlGS:0000000000000000
CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00000000008d9828 CR3: 0000000000201000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process qemu-system-x86 (pid: 4768, threadinfo ffff88021e1f8000, task
ffff880119c270f8)
Stack:
ffff88022bdfd840 ffff880119da11b8 ffffc20011c30000 000000000000813a
0000000000000000 0000000000000001 ffff88022ec11c18 ffff88022f061838
ffff88021e1f9aa8 ffffffff8028ab1d ffff88021e1f9aa8 ffffc20021976000
Call Trace:
[<ffffffff8028ab1d>] free_unmap_vmap_area_noflush+0x69/0x70
[<ffffffff8028ab49>] remove_vm_area+0x25/0x71
[<ffffffff8028ac54>] __vunmap+0x3a/0xca
[<ffffffff8028ad35>] vfree+0x29/0x2b
[<ffffffffa00f98a3>] kvm_free_physmem_slot+0x25/0x7c [kvm]
[<ffffffffa00f9d75>] kvm_free_physmem+0x27/0x36 [kvm]
[<ffffffffa00fccb4>] kvm_arch_destroy_vm+0xa6/0xda [kvm]
[<ffffffffa00f9e11>] kvm_put_kvm+0x8d/0xa7 [kvm]
[<ffffffffa00fa0e2>] kvm_vcpu_release+0x13/0x17 [kvm]
[<ffffffff802a1c07>] __fput+0xeb/0x1a3
[<ffffffff802a1cd4>] fput+0x15/0x17
[<ffffffff8029f26c>] filp_close+0x67/0x72
[<ffffffff802378a8>] put_files_struct+0x74/0xc8
[<ffffffff80237943>] exit_files+0x47/0x4f
[<ffffffff80238fe5>] do_exit+0x1eb/0x7a7
[<ffffffff80587edf>] ? _spin_unlock_irq+0x2b/0x51
[<ffffffff80239614>] do_group_exit+0x73/0xa0
[<ffffffff80242b10>] get_signal_to_deliver+0x30c/0x32c
[<ffffffff8020b4d5>] ? sysret_signal+0x19/0x29
[<ffffffff8020a80f>] do_notify_resume+0x8c/0x851
[<ffffffff8025b811>] ? do_futex+0x90/0x92a
[<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114
[<ffffffff80587f51>] ? _spin_unlock_irqrestore+0x4c/0x68
[<ffffffff8026be5c>] ? __rcu_read_unlock+0x92/0x9e
[<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114
[<ffffffff80256c08>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff8024f300>] ? getnstimeofday+0x3a/0x96
[<ffffffff8024c4f0>] ? ktime_get_ts+0x49/0x4e
[<ffffffff8020b4c1>] ? sysret_signal+0x5/0x29
[<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114
[<ffffffff8020b4d5>] ? sysret_signal+0x19/0x29
[<ffffffff8020b7b7>] ptregscall_common+0x67/0xb0
Code: 46 48 c7 c7 c0 d1 74 80 4c 8d 65 c0 e8 0c db 2f 00 48 8b 45 c0 48 8d 58
c0 eb 10 48 89 df e8 74 fe ff ff 48 8b 43 40 48 8d 58 c0 <48> 8b 43 40 0f 18 08
48 8d 43 40 4c 39 e0 75 e0 48 c7 c7 c0 d1
RIP [<ffffffff8028a5b6>] __purge_vmap_area_lazy+0x12c/0x163
RSP <ffff88021e1f9a38>
---[ end trace fde3e64ebe4bbca2 ]---
Fixing recursive fault but reboot is needed!
BUG: scheduling while atomic: qemu-system-x86/4768/0x00000003
Modules linked in: tun ipt_MASQUERADE iptable_nat nf_nat bridge stp llc
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT
iptable_filter ip_tables x_tables dm_multipath kvm_intel kvm scsi_wait_scan
ata_piix libata dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod
shpchp pci_hotplug mptsas mptscsih mptbase scsi_transport_sas uhci_hcd ohci_hcd
ehci_hcd
Pid: 4768, comm: qemu-system-x86 Tainted: G D
2.6.28-00165-g4f27e3e-dirty #164
Call Trace:
[<ffffffff8025585e>] ? __debug_show_held_locks+0x1b/0x24
[<ffffffff8023187b>] __schedule_bug+0x8c/0x95
[<ffffffff805851e1>] schedule+0xd3/0x902
[<ffffffff80256c08>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff8037a938>] ? put_io_context+0x67/0x72
[<ffffffff80238ed4>] do_exit+0xda/0x7a7
[<ffffffff805892c9>] oops_begin+0x0/0x90
[<ffffffff8020e3c9>] die+0x5d/0x66
[<ffffffff80588ff7>] do_general_protection+0x128/0x130
[<ffffffff80588ecf>] ? do_general_protection+0x0/0x130
[<ffffffff80588702>] error_exit+0x0/0xa9
[<ffffffff8028a5b6>] ? __purge_vmap_area_lazy+0x12c/0x163
[<ffffffff8028a5ae>] ? __purge_vmap_area_lazy+0x124/0x163
[<ffffffff8028ab1d>] free_unmap_vmap_area_noflush+0x69/0x70
[<ffffffff8028ab49>] remove_vm_area+0x25/0x71
[<ffffffff8028ac54>] __vunmap+0x3a/0xca
[<ffffffff8028ad35>] vfree+0x29/0x2b
[<ffffffffa00f98a3>] kvm_free_physmem_slot+0x25/0x7c [kvm]
[<ffffffffa00f9d75>] kvm_free_physmem+0x27/0x36 [kvm]
[<ffffffffa00fccb4>] kvm_arch_destroy_vm+0xa6/0xda [kvm]
[<ffffffffa00f9e11>] kvm_put_kvm+0x8d/0xa7 [kvm]
[<ffffffffa00fa0e2>] kvm_vcpu_release+0x13/0x17 [kvm]
[<ffffffff802a1c07>] __fput+0xeb/0x1a3
[<ffffffff802a1cd4>] fput+0x15/0x17
[<ffffffff8029f26c>] filp_close+0x67/0x72
[<ffffffff802378a8>] put_files_struct+0x74/0xc8
[<ffffffff80237943>] exit_files+0x47/0x4f
[<ffffffff80238fe5>] do_exit+0x1eb/0x7a7
[<ffffffff80587edf>] ? _spin_unlock_irq+0x2b/0x51
[<ffffffff80239614>] do_group_exit+0x73/0xa0
[<ffffffff80242b10>] get_signal_to_deliver+0x30c/0x32c
[<ffffffff8020b4d5>] ? sysret_signal+0x19/0x29
[<ffffffff8020a80f>] do_notify_resume+0x8c/0x851
[<ffffffff8025b811>] ? do_futex+0x90/0x92a
[<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114
[<ffffffff80587f51>] ? _spin_unlock_irqrestore+0x4c/0x68
[<ffffffff8026be5c>] ? __rcu_read_unlock+0x92/0x9e
[<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114
[<ffffffff80256c08>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff8024f300>] ? getnstimeofday+0x3a/0x96
[<ffffffff8024c4f0>] ? ktime_get_ts+0x49/0x4e
[<ffffffff8020b4c1>] ? sysret_signal+0x5/0x29
[<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114
[<ffffffff8020b4d5>] ? sysret_signal+0x19/0x29
[<ffffffff8020b7b7>] ptregscall_common+0x67/0xb0
ttyS1: 26 input overrun(s)
And its not specific to vm shutdown path. Another instance:
general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
last sysfs file: /sys/class/net/tap0/address
CPU 5
Modules linked in: ipt_REJECT xt_state xt_tcpudp iptable_filter ipt_MASQUERADE
iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables
x_tables tun kvm_intel kvm bridge stp llc dm_multipath scsi_wait_scan ata_piix
libata dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp
pci_hotplug mptsas mptscsih mptbase scsi_transport_sas uhci_hcd ohci_hcd
ehci_hcd [last unloaded: x_tables]
Pid: 4440, comm: qemu-system-x86 Not tainted 2.6.28-00165-g4f27e3e-dirty #163
RIP: 0010:[<ffffffff8028a5b6>] [<ffffffff8028a5b6>]
__purge_vmap_area_lazy+0x12c/0x163
RSP: 0018:ffff88011f4c7be8 EFLAGS: 00010246
RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000003
RDX: ffffffff80a1dae0 RSI: ffff880028083980 RDI: 0000000000000001
RBP: ffff88011f4c7c28 R08: 0000000000000282 R09: ffffffff80a1bf50
R10: ffff88022e9dc0f8 R11: ffff88011f4c7b68 R12: ffff88011f4c7be8
R13: ffff88011f4c7c40 R14: ffff88011f4c7c48 R15: 0000000000008001
FS: 0000000040abf950(0063) GS:ffff88022f25ed18(0000) knlGS:0000000000000000
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000229d34000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process qemu-system-x86 (pid: 4440, threadinfo ffff88011f4c6000, task
ffff88022e9dc0f8)
Stack:
ffff8802291a14b0 ffff880229003d58 ffffc20021000000 0000000000008001
ffff880229526000 0000000000000000 ffff88022d073000 ffff88011f58c0c0
ffff88011f4c7c58 ffffffff8028ab1d ffff88011f4c7c58 ffffffffa015c000
Call Trace:
[<ffffffff8028ab1d>] free_unmap_vmap_area_noflush+0x69/0x70
[<ffffffff8028ab49>] remove_vm_area+0x25/0x71
[<ffffffff8028ac54>] __vunmap+0x3a/0xca
[<ffffffff8028ad0a>] vunmap+0x26/0x28
[<ffffffffa01be092>] pio_copy_data+0xcf/0x113 [kvm]
[<ffffffff80256c08>] ? trace_hardirqs_on+0xd/0xf
[<ffffffffa01be16f>] complete_pio+0x99/0x1ef [kvm]
[<ffffffff8023fcd2>] ? sigprocmask+0xc6/0xd0
[<ffffffffa01c0295>] kvm_arch_vcpu_ioctl_run+0x9a/0x889 [kvm]
[<ffffffffa01b84f4>] kvm_vcpu_ioctl+0xfc/0x48b [kvm]
[<ffffffff802ac760>] vfs_ioctl+0x2a/0x78
[<ffffffff8026be5c>] ? __rcu_read_unlock+0x92/0x9e
[<ffffffff802acb46>] do_vfs_ioctl+0x398/0x3c6
[<ffffffff80256c08>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff802acbb6>] sys_ioctl+0x42/0x65
[<ffffffff8020b43b>] system_call_fastpath+0x16/0x1b
Code: 46 48 c7 c7 c0 d1 74 80 4c 8d 65 c0 e8 8c da 2f 00 48 8b 45 c0 48 8d 58
c0 eb 10 48 89 df e8 74 fe ff ff 48 8b 43 40 48 8d 58 c0 <48> 8b 43 40 0f 18 08
48 8d 43 40 4c 39 e0 75 e0 48 c7 c7 c0 d1
RIP [<ffffffff8028a5b6>] __purge_vmap_area_lazy+0x12c/0x163
RSP <ffff88011f4c7be8>
---[ end trace 31811279a2e983e8 ]---
note: qemu-system-x86[4440] exited with preempt_count 2
(gdb) l *(__purge_vmap_area_lazy + 0x12c)
0xffffffff80289ca2 is in __purge_vmap_area_lazy (mm/vmalloc.c:516).
511 if (nr || force_flush)
512 flush_tlb_kernel_range(*start, *end);
513
514 if (nr) {
515 spin_lock(&vmap_area_lock);
516 list_for_each_entry(va, &valist, purge_list)
517 __free_vmap_area(va);
518 spin_unlock(&vmap_area_lock);
519 }
520 spin_unlock(&purge_lock);
0xffffffff80289c9a <__purge_vmap_area_lazy+292>: mov 0x40(%rbx),%rax
0xffffffff80289c9e <__purge_vmap_area_lazy+296>: lea -0x40(%rax),%rbx
0xffffffff80289ca2 <__purge_vmap_area_lazy+300>: mov 0x40(%rbx),%rax
^^^^^^^^^^^^^^^^^^^
0xffffffff80289ca6 <__purge_vmap_area_lazy+304>: prefetcht0 (%rax)
Which vanishes once PREEMPT_RCU is disabled.
Nick? KVM does not make direct use of RCU. Same issue happens if the
entire __purge_vmap_area_lazy runs with vmap_area_lock held.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html