On 04/24/2012 12:16 AM, Avi Kivity wrote:
> Using RCU for lockless shadow walking can increase the amount of memory
> in use by the system, since RCU grace periods are unpredictable. We also
> have an unconditional write to a shared variable (reader_counter), which
> isn't good for scaling.
>
> Replace that with a scheme similar to x86's get_user_pages_fast(): disable
> interrupts during lockless shadow walk to force the freer
> (kvm_mmu_commit_zap_page()) to wait for the TLB flush IPI to find the
> processor with interrupts enabled.
>
> We also add a new vcpu->mode, READING_SHADOW_PAGE_TABLES, to prevent
> kvm_flush_remote_tlbs() from avoiding the IPI.
>
> Signed-off-by: Avi Kivity <[email protected]>
> ---
>
> Turned out to be simpler than expected. However, I think there's a problem
> with make_all_cpus_request() possible reading an incorrect vcpu->cpu.
It seems possible.
Can we fix it by reading vcpu->cpu when the vcpu is in GUEST_MODE or
EXITING_GUEST_MODE (IIRC, in these modes, interrupt is disabled)?
Like:
if (kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
cpumask_set_cpu(vcpu->cpu, cpus);
>
> arch/x86/include/asm/kvm_host.h | 4 ---
> arch/x86/kvm/mmu.c | 61
> +++++++++++----------------------------
> include/linux/kvm_host.h | 3 +-
> 3 files changed, 19 insertions(+), 49 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f624ca7..67e66e6 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -237,8 +237,6 @@ struct kvm_mmu_page {
> #endif
>
> int write_flooding_count;
> -
> - struct rcu_head rcu;
> };
>
> struct kvm_pio_request {
> @@ -536,8 +534,6 @@ struct kvm_arch {
> u64 hv_guest_os_id;
> u64 hv_hypercall;
>
> - atomic_t reader_counter;
> -
> #ifdef CONFIG_KVM_MMU_AUDIT
> int audit_point;
> #endif
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 07424cf..903af5e 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -551,19 +551,23 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
>
> static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
> {
> - rcu_read_lock();
> - atomic_inc(&vcpu->kvm->arch.reader_counter);
> -
> - /* Increase the counter before walking shadow page table */
> - smp_mb__after_atomic_inc();
> + /*
> + * Prevent page table teardown by making any free-er wait during
> + * kvm_flush_remote_tlbs() IPI to all active vcpus.
> + */
> + local_irq_disable();
> + vcpu->mode = READING_SHADOW_PAGE_TABLES;
> + /*
> + * wmb: advertise vcpu->mode change
> + * rmb: make sure we see updated sptes
> + */
> + smp_mb();
> }
>
> static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> {
> - /* Decrease the counter after walking shadow page table finished */
> - smp_mb__before_atomic_dec();
> - atomic_dec(&vcpu->kvm->arch.reader_counter);
> - rcu_read_unlock();
We need a mb here to avoid that setting vcpu->mode is reordered to the head
of reading/writing spte? (it is safe on x86, but we need a comment at least?)
Otherwise it looks good to me, i will measure it later.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html