On Tue, Sep 10, 2019 at 12:26 PM Thomas Hellström (VMware)
<[email protected]> wrote:
>
> On 9/10/19 6:11 PM, Andy Lutomirski wrote:
> >
> >> On Sep 5, 2019, at 8:24 AM, Christoph Hellwig <[email protected]> wrote:
> >>
> >>> On Thu, Sep 05, 2019 at 05:21:24PM +0200, Thomas Hellström (VMware) wrote:
> >>>> On 9/5/19 4:15 PM, Dave Hansen wrote:
> >>>> Hi Thomas,
> >>>>
> >>>> Thanks for the second batch of patches! These look much improved on all
> >>>> fronts.
> >>> Yes, although the TTM functionality isn't in yet. Hopefully we won't have
> >>> to
> >>> bother you with those though, since this assumes TTM will be using the dma
> >>> API.
> >> Please take a look at dma_mmap_prepare and dma_mmap_fault in this
> >> branch:
> >>
> >>
> >> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-mmap-improvements
> >>
> >> they should allow to fault dma api pages in the page fault handler. But
> >> this is totally hot off the press and not actually tested for the last
> >> few patches. Note that I've also included your two patches from this
> >> series to handle SEV.
> > I read that patch, and it seems like you’ve built in the assumption that
> > all pages in the mapping use identical protection or, if not, that the same
> > fake vma hack that TTM already has is used to fudge around it. Could it be
> > reworked slightly to avoid this?
> >
> > I wonder if it’s a mistake to put the encryption bits in vm_page_prot at
> > all.
>
> From my POW, the encryption bits behave quite similar in behaviour to
> the caching mode bits, and they're also in vm_page_prot. They're the
> reason TTM needs to modify the page protection in the fault handler in
> the first place.
>
> The problem seen in TTM is that we want to be able to change the
> vm_page_prot from the fault handler, but it's problematic since we have
> the mmap_sem typically only in read mode. Hence the fake vma hack. From
> what I can tell it's reasonably well-behaved, since pte_modify() skips
> the bits TTM updates, so mprotect() and mremap() works OK. I think
> split_huge_pmd may run into trouble, but we don't support it (yet) with
> TTM.
One thing I'm confused about: does TTM move individual pages between
main memory and device memory or does it move whole mappings? If it
moves individual pages, then a single mapping could have PTEs from
dma_alloc_coherent() space and from PCI space. How can this work with
vm_page_prot? I guess you might get lucky and have both have the same
protection bits, but that seems like an unfortunate thing to rely on.
As a for-real example, take a look at arch/x86/entry/vdso/vma.c. The
"vvar" VMA contains multiple pages that are backed by different types
of memory. There's a page of ordinary kernel memory. Historically
there was a page of genuine MMIO memory, but that's gone now. If the
system is a SEV guest with pvclock enabled, then there's a page of
decrypted memory. They all share a VMA, they're instantiated in
.fault, and there is no hackery involved. Look at vvar_fault().
So, Christoph, can't you restructure your code a bit to compute the
caching and encryption state per page instead of once in
dma_mmap_prepare() and insert the pages accordingly? You might need
to revert 6d958546ff611c9ae09b181e628c1c5d5da5ebda depending on
exactly how you structure it.
>
> We could probably get away with a WRITE_ONCE() update of the
> vm_page_prot before taking the page table lock since
>
> a) We're locking out all other writers.
> b) We can't race with another fault to the same vma since we hold an
> address space lock ("buffer object reservation")
> c) When we need to update there are no valid page table entries in the
> vma, since it only happens directly after mmap(), or after an
> unmap_mapping_range() with the same address space lock. When another
> reader (for example split_huge_pmd()) sees a valid page table entry, it
> also sees the new page protection and things are fine.
>
> But that would really be a special case. To solve this properly we'd
> probably need an additional lock to protect the vm_flags and
> vm_page_prot, taken after mmap_sem and i_mmap_lock.
>
This is all horrible IMO.