On Fri, Aug 10, 2012 at 10:34:39AM +1000, Paul Mackerras wrote:
> On Thu, Aug 09, 2012 at 03:16:12PM -0300, Marcelo Tosatti wrote:
>
> > The !memslot->npages case is handled in __kvm_set_memory_region
> > (please read that part, before kvm_arch_prepare_memory_region() call).
> >
> > kvm_arch_flush_shadow should be implemented.
>
> Book3S HV doesn't have shadow page tables per se, rather the hardware
> page table is under the control of the hypervisor (i.e. KVM), and
> entries are added and removed by the guest using hypercalls. On
> recent machines (POWER7) the hypervisor can choose whether or not to
> have the hardware PTE point to a real page of memory; if it doesn't,
> access by the guest will trap to the hypervisor. On older machines
> (PPC970) we don't have that flexibility, and we have to provide a real
> page of memory (i.e. RAM or I/O) behind every hardware PTE. (This is
> because PPC970 provides no way for page faults in the guest to go to
> the hypervisor.)
>
> I could implement kvm_arch_flush_shadow to remove the backing pages
> behind every hardware PTE, but that would be very slow and inefficient
> on POWER7, and would break the guest on PPC970, particularly in the
> case where userspace is removing a small memory slot containing some
> I/O device and leaving the memory slot for system RAM untouched.
>
> So the reason for unmapping the hardware PTEs in
> kvm_arch_prepare_memory_region rather than kvm_arch_flush_shadow is
> that that way we know which memslot is going away.
>
> What exactly are the semantics of kvm_arch_flush_shadow?
It removes all translations mapped via memslots. Its used in cases where
the translations become stale, or during shutdown.
> I presume that on x86 with NPT/EPT it basically does nothing - is that right?
It does, it removes all NPT/EPT ptes (named "sptes" in arch/x86/kvm/).
The translations are rebuilt on demand (when accesses by the guest fault
into the HV).
> > > + if (old->npages) {
> > > + /* modifying guest_phys or flags */
> > > + if (old->base_gfn != memslot->base_gfn)
> > > + kvmppc_unmap_memslot(kvm, old);
> >
> > This case is also handled generically by the last kvm_arch_flush_shadow
> > call in __kvm_set_memory_region.
>
> Again, to use this we would need to know which memslot we're
> flushing. If we could change __kvm_set_memory_region to pass the
> memslot for these kvm_arch_flush_shadow calls, then I could do as you
> suggest. (Though I would need to think carefully about what would
> happen with guest invalidations of hardware PTEs in the interval
> between the rcu_assign_pointer(kvm->memslots, slots) and the
> kvm_arch_flush_shadow, and whether the invalidation would find the
> correct location in the rmap array, given that we have updated the
> base_gfn in the memslot without first getting rid of any references to
> those pages in the hardware page table.)
That can be done.
I'll send a patch to flush per memslot in the next days, you can work
out the PPC details in the meantime.
To be clear: this is necessary to have consistent behaviour across
arches in the kvm_set_memory codepath which is tricky (not nitpicking).
Alternatively, kvm_arch_flush_shadow can be split into two methods (but
thats not necessary if memslot information is sufficient for PPC).
> > > + if (memslot->dirty_bitmap &&
> > > + old->dirty_bitmap != memslot->dirty_bitmap)
> > > + kvmppc_hv_get_dirty_log(kvm, old);
> > > + return 0;
> > > + }
> >
> > Better clear dirty log unconditionally on kvm_arch_commit_memory_region,
> > similarly to x86 (just so its consistent).
>
> OK.
>
> > > --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> > > +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> > > @@ -81,7 +81,7 @@ static void remove_revmap_chain(struct kvm *kvm, long
> > > pte_index,
> > > ptel = rev->guest_rpte |= rcbits;
> > > gfn = hpte_rpn(ptel, hpte_page_size(hpte_v, ptel));
> > > memslot = __gfn_to_memslot(kvm_memslots(kvm), gfn);
> > > - if (!memslot || (memslot->flags & KVM_MEMSLOT_INVALID))
> > > + if (!memslot)
> > > return;
> >
> > Why remove this check? (i don't know why it was there in the first
> > place, just checking).
>
> This is where we are removing the page backing a hardware PTE and thus
> removing the hardware PTE from the reverse-mapping list for the page.
> We want to be able to do that properly even if the memslot is in the
> process of going away. I had the flags check in there originally
> because other places that used a memslot had that check, but when I
> read __kvm_set_memory_region more carefully I realized that the
> KVM_MEMSLOT_INVALID flag indicates that we should not create any more
> references to pages in the memslot, but we do still need to be able to
> handle references going away, i.e. pages in the memslot getting
> unmapped.
>
> Paul.
Yes, thats it. kvm_arch_flush_shadow requires functional memslot lookup,
for example.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html