On Fri, Feb 10, 2012 at 03:28:31PM +0900, Takuya Yoshikawa wrote:
> Other threads may process the same page in that small window and skip
> TLB flush and then return before these functions do flush.

It's correct to flush the shadow MMU TLB without the mmu_lock only in
the context of mmu notifier methods. So the below while won't hurt,
it's performance regression and shouldn't be applied (and
it obfuscates the code by not being strict anymore).

To the contrary every other place that does a shadow/secondary MMU smp
tlb flush _must_ happen inside the mmu_lock, otherwise the
serialization isn't correct anymore against the very below mmu_lock in
the below quoted patch taken by
kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start.

The explanation is in commit 4539b35881ae9664b0e2953438dd83f5ee02c0b4.

I'll try to explain it more clearly: the moment you drop mmu_lock,
pages can be freed. So if you invalidate a spte in any place inside
the KVM code (except the mmu notifier methods where a reference of the
page is implicitly hold by the caller and so the page can't go away
under a mmu notifier method by design), then the below
kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start
won't get their need_tlb_flush set anymore, and they won't run the tlb
flush before freeing the page.

So every other place (except mmu notifier) must flush the secondary
MMU smp tlb _before_ releasing the mmu_lock.

Only mmu notifier is safe to flush the secondary MMU TLB _after_
releasing the mmu_lock.

If we changed the mmu notifier methods to unconditionally flush the
shadow TLB (regardless if a spte was present or not), we might not
need to hold the mmu_lock in every tlb flush outside the context of
the mmu notifier methods. But then the mmu notifier methods would
become more expensive, I didn't evaluate fully what would be the side
effects of such a change. Also note, only the
kvm_mmu_notifier_invalidate_page and
kvm_mmu_notifier_invalidate_range_start would need to do that, because
they're the only two where the page reference gets dropped.

Even shorter: because the mmu notifier a implicit reference on the
page exists and is hold by the caller, they can flush outside the
mmu_lock. Every other place in KVM only holds an implicit valid
reference on the page only as long as you hold the mmu_lock, or while
a spte is still established.

Well it's not easy logic so it's not surprising it wasn't totally
clear.

It's probably not heavily documented, and the fact you changed it
still is still good so we refresh our minds on the exact rules of mmu
notifier locking, thanks!

Andrea

> 
> Signed-off-by: Takuya Yoshikawa <yoshikawa.tak...@oss.ntt.co.jp>
> ---
>  virt/kvm/kvm_main.c |   19 ++++++++++---------
>  1 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 470e305..2b4bc77 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -289,15 +289,15 @@ static void kvm_mmu_notifier_invalidate_page(struct 
> mmu_notifier *mn,
>        */
>       idx = srcu_read_lock(&kvm->srcu);
>       spin_lock(&kvm->mmu_lock);
> +
>       kvm->mmu_notifier_seq++;
>       need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm->tlbs_dirty;
> -     spin_unlock(&kvm->mmu_lock);
> -     srcu_read_unlock(&kvm->srcu, idx);
> -
>       /* we've to flush the tlb before the pages can be freed */
>       if (need_tlb_flush)
>               kvm_flush_remote_tlbs(kvm);
>  
> +     spin_unlock(&kvm->mmu_lock);
> +     srcu_read_unlock(&kvm->srcu, idx);
>  }
>  
>  static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> @@ -335,12 +335,12 @@ static void 
> kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>       for (; start < end; start += PAGE_SIZE)
>               need_tlb_flush |= kvm_unmap_hva(kvm, start);
>       need_tlb_flush |= kvm->tlbs_dirty;
> -     spin_unlock(&kvm->mmu_lock);
> -     srcu_read_unlock(&kvm->srcu, idx);
> -
>       /* we've to flush the tlb before the pages can be freed */
>       if (need_tlb_flush)
>               kvm_flush_remote_tlbs(kvm);
> +
> +     spin_unlock(&kvm->mmu_lock);
> +     srcu_read_unlock(&kvm->srcu, idx);
>  }
>  
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> @@ -378,13 +378,14 @@ static int kvm_mmu_notifier_clear_flush_young(struct 
> mmu_notifier *mn,
>  
>       idx = srcu_read_lock(&kvm->srcu);
>       spin_lock(&kvm->mmu_lock);
> -     young = kvm_age_hva(kvm, address);
> -     spin_unlock(&kvm->mmu_lock);
> -     srcu_read_unlock(&kvm->srcu, idx);
>  
> +     young = kvm_age_hva(kvm, address);
>       if (young)
>               kvm_flush_remote_tlbs(kvm);
>  
> +     spin_unlock(&kvm->mmu_lock);
> +     srcu_read_unlock(&kvm->srcu, idx);
> +
>       return young;
>  }
>  
> -- 
> 1.7.5.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to