Re: [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages

2021-08-20 Thread Jerome Glisse
On Thu, Aug 19, 2021 at 10:05 PM Christoph Hellwig  wrote:
>
> On Tue, Aug 17, 2021 at 11:44:54AM -0400, Felix Kuehling wrote:
> > >> That's a good catch. Existing drivers shouldn't need a page_free
> > >> callback if they didn't have one before. That means we need to add a
> > >> NULL-pointer check in free_device_page.
> > > Also the other state clearing (__ClearPageWaiters/mem_cgroup_uncharge/
> > > ->mapping = NULL).
> > >
> > > In many ways this seems like you want to bring back the DEVICE_PUBLIC
> > > pgmap type that was removed a while ago due to the lack of users
> > > instead of overloading the generic type.
> >
> > I think so. I'm not clear about how DEVICE_PUBLIC differed from what
> > DEVICE_GENERIC is today. As I understand it, DEVICE_PUBLIC was removed
> > because it was unused and also known to be broken in some ways.
> > DEVICE_GENERIC seemed close enough to what we need, other than not being
> > supported in the migration helpers.
> >
> > Would you see benefit in re-introducing DEVICE_PUBLIC as a distinct
> > memory type from DEVICE_GENERIC? What would be the benefits of making
> > that distinction?
>
> The old DEVICE_PUBLIC mostly different in that it allowed the page
> to be returned from vm_normal_page, which I think was horribly buggy.

Why was that buggy ? If I were to do it now, i would return
DEVICE_PUBLIC page from vm_normal_page but i would ban pinning as
pinning is exceptionally wrong for GPU. If you migrate some random
anonymous/file back to your GPU memory and it gets pinned there then
there is no way for the GPU to migrate the page out. Quickly you will
run out of physically contiguous memory and things like big graphic
buffer allocation (anything that needs physically contiguous memory)
will fail. It is less of an issue on some hardware that rely less and
less on physically contiguous memory but i do not think it is
completely gone from all hw.

> But the point is not to bring back these old semantics.  The idea
> is to be able to differeniate between your new coherent on-device
> memory and the existing DEVICE_GENERIC.  That is call the
> code in free_devmap_managed_page that is currently only used
> for device private pages also for your new public device pages without
> affecting the devdax and xen use cases.

Yes, I would rather bring back DEVICE_PUBLIC then try to use
DEVICE_GENERIC, the GENERIC change was done for users that closely
matched DAX semantics and it is not the case here, at least not from
my point of view.

Jerome


Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount

2021-08-20 Thread Jerome Glisse
On Thu, Aug 19, 2021 at 11:00 AM Sierra Guiza, Alejandro (Alex)
 wrote:
>
>
> On 8/18/2021 2:28 PM, Ralph Campbell wrote:
> > On 8/17/21 5:35 PM, Felix Kuehling wrote:
> >> Am 2021-08-17 um 8:01 p.m. schrieb Ralph Campbell:
> >>> On 8/12/21 11:31 PM, Alex Sierra wrote:
>  From: Ralph Campbell 
> 
>  ZONE_DEVICE struct pages have an extra reference count that
>  complicates the
>  code for put_page() and several places in the kernel that need to
>  check the
>  reference count to see that a page is not being used (gup, compaction,
>  migration, etc.). Clean up the code so the reference count doesn't
>  need to
>  be treated specially for ZONE_DEVICE.
> 
>  v2:
>  AS: merged this patch in linux 5.11 version
> 
>  v5:
>  AS: add condition at try_grab_page to check for the zone device type,
>  while
>  page ref counter is checked less/equal to zero. In case of device
>  zone, pages
>  ref counter are initialized to zero.
> 
>  Signed-off-by: Ralph Campbell 
>  Signed-off-by: Alex Sierra 
>  ---
> arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
> drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
> fs/dax.c   |  4 +-
> include/linux/dax.h|  2 +-
> include/linux/memremap.h   |  7 +--
> include/linux/mm.h | 13 +
> lib/test_hmm.c |  2 +-
> mm/internal.h  |  8 +++
> mm/memremap.c  | 68
>  +++---
> mm/migrate.c   |  5 --
> mm/page_alloc.c|  3 ++
> mm/swap.c  | 45 ++---
> 12 files changed, 46 insertions(+), 115 deletions(-)
> 
> >>> I haven't seen a response to the issues I raised back at v3 of this
> >>> series.
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2F4f6dd918-d79b-1aa7-3a4c-caa67ddc29bc%40nvidia.com%2Fdata=04%7C01%7Calex.sierra%40amd.com%7Cd2bd2d4fbf764528540908d9627e5dcd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649117156919772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=P7FxYm%2BkJaCkMFa3OHtuKrPOn7SvytFRmYQdIzq7rN4%3Dreserved=0
> >>>
> >>>
> >>>
> >>> Did I miss something?
> >> I think part of the response was that we did more testing. Alex added
> >> support for DEVICE_GENERIC pages to test_hmm and he ran DAX tests
> >> recommended by Theodore Tso. In that testing he ran into a WARN_ON_ONCE
> >> about a zero page refcount in try_get_page. The fix is in the latest
> >> version of patch 2. But it's already obsolete because John Hubbard is
> >> about to remove that function altogether.
> >>
> >> I think the issues you raised were more uncertainty than known bugs. It
> >> seems the fact that you can have DAX pages with 0 refcount is a feature
> >> more than a bug.
> >>
> >> Regards,
> >>Felix
> >
> > Did you test on a system without CONFIG_ARCH_HAS_PTE_SPECIAL defined?
> > In that case, mmap() of a DAX device will call insert_page() which calls
> > get_page() which would trigger VM_BUG_ON_PAGE().
> >
> > I can believe it is OK for PTE_SPECIAL page table entries to have no
> > struct page or that MEMORY_DEVICE_GENERIC struct pages be mapped with
> > a zero reference count using insert_pfn().
> Hi Ralph,
> We have tried the DAX tests with and without CONFIG_ARCH_HAS_PTE_SPECIAL
> defined.
> Apparently none of the tests touches that condition for a DAX device. Of
> course,
> that doesn't mean it could happen.
>
> Regards,
> Alex S.
>
> >
> >
> > I find it hard to believe that other MM developers don't see an issue
> > with a struct page with refcount == 0 and mapcount == 1.
> >
> > I don't see where init_page_count() is being called for the
> > MEMORY_DEVICE_GENERIC or MEMORY_DEVICE_PRIVATE struct pages the AMD
> > driver allocates and passes to migrate_vma_setup().
> > Looks like svm_migrate_get_vram_page() needs to call init_page_count()
> > instead of get_page(). (I'm looking at branch
> > origin/alexsierrag/device_generic
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver.gitdata=04%7C01%7Calex.sierra%40amd.com%7Cd2bd2d4fbf764528540908d9627e5dcd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649117156919772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=IXe8HP2s8x5OdJdERBkGOYJCQk3iqCu5AYkwpDL8zec%3Dreserved=0)
> Yes, you're right. My bad. Thanks for catching this up. I didn't realize
> I was missing
> to define CONFIG_DEBUG_VM on my build. Therefore this BUG was never caught.
> It worked after I replaced get_pages by init_page_count at
> svm_migrate_get_vram_page. However, I don't think this is the best 

Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount

2021-08-20 Thread Jerome Glisse
Note that you do not want GUP to succeed on device page, i do not see
where that is handled in the new code.

On Sun, Aug 15, 2021 at 1:40 PM John Hubbard  wrote:
>
> On 8/15/21 8:37 AM, Christoph Hellwig wrote:
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index 8ae31622deef..d48a1f0889d1 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -1218,7 +1218,7 @@ __maybe_unused struct page 
> >> *try_grab_compound_head(struct page *page, int refs,
> >>   static inline __must_check bool try_get_page(struct page *page)
> >>   {
> >>  page = compound_head(page);
> >> -if (WARN_ON_ONCE(page_ref_count(page) <= 0))
> >> +if (WARN_ON_ONCE(page_ref_count(page) < 
> >> (int)!is_zone_device_page(page)))
> >
> > Please avoid the overly long line.  In fact I'd be tempted to just not
> > bother here and keep the old, more lose check.  Especially given that
> > John has a patch ready that removes try_get_page entirely.
> >
>
> Yes. Andrew has accepted it into mmotm.
>
> Ralph's patch here was written well before my cleanup that removed
> try_grab_page() [1]. But now that we're here, if you drop this hunk then
> it will make merging easier, I think.
>
>
> [1] https://lore.kernel.org/r/20210813044133.1536842-4-jhubb...@nvidia.com
>
> thanks,
> --
> John Hubbard
> NVIDIA
>


Re: HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD)

2021-01-14 Thread Jerome Glisse
On Thu, Jan 14, 2021 at 02:37:36PM +0100, Christian König wrote:
> Am 14.01.21 um 12:52 schrieb Daniel Vetter:
> > [SNIP]
> > > > I had a new idea, i wanted to think more about it but have not yet,
> > > > anyway here it is. Adding a new callback to dma fence which ask the
> > > > question can it dead lock ? Any time a GPU driver has pending page
> > > > fault (ie something calling into the mm) it answer yes, otherwise
> > > > no. The GPU shrinker would ask the question before waiting on any
> > > > dma-fence and back of if it gets yes. Shrinker can still try many
> > > > dma buf object for which it does not get a yes on associated fence.
> > > > 
> > > > This does not solve the mmu notifier case, for this you would just
> > > > invalidate the gem userptr object (with a flag but not releasing the
> > > > page refcount) but you would not wait for the GPU (ie no dma fence
> > > > wait in that code path anymore). The userptr API never really made
> > > > the contract that it will always be in sync with the mm view of the
> > > > world so if different page get remapped to same virtual address
> > > > while GPU is still working with the old pages it should not be an
> > > > issue (it would not be in our usage of userptr for compositor and
> > > > what not).
> > > The current working idea in my mind goes into a similar direction.
> > > 
> > > But instead of a callback I'm adding a complete new class of HMM fences.
> > > 
> > > Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for
> > > the dma_fences and HMM fences are ignored in container objects.
> > > 
> > > When you handle an implicit or explicit synchronization request from
> > > userspace you need to block for HMM fences to complete before taking any
> > > resource locks.
> > Isnt' that what I call gang scheduling? I.e. you either run in HMM
> > mode, or in legacy fencing mode (whether implicit or explicit doesn't
> > really matter I think). By forcing that split we avoid the problem,
> > but it means occasionally full stalls on mixed workloads.
> > 
> > But that's not what Jerome wants (afaiui at least), I think his idea
> > is to track the reverse dependencies of all the fences floating
> > around, and then skip evicting an object if you have to wait for any
> > fence that is problematic for the current calling context. And I don't
> > think that's very feasible in practice.
> > 
> > So what kind of hmm fences do you have in mind here?
> 
> It's a bit more relaxed than your gang schedule.
> 
> See the requirements are as follow:
> 
> 1. dma_fences never depend on hmm_fences.
> 2. hmm_fences can never preempt dma_fences.
> 3. dma_fences must be able to preempt hmm_fences or we always reserve enough
> hardware resources (CUs) to guarantee forward progress of dma_fences.
> 
> Critical sections are MMU notifiers, page faults, GPU schedulers and
> dma_reservation object locks.
> 
> 4. It is valid to wait for a dma_fences in critical sections.
> 5. It is not valid to wait for hmm_fences in critical sections.
> 
> Fence creation either happens during command submission or by adding
> something like a barrier or signal command to your userspace queue.
> 
> 6. If we have an hmm_fence as implicit or explicit dependency for creating a
> dma_fence we must wait for that before taking any locks or reserving
> resources.
> 7. If we have a dma_fence as implicit or explicit dependency for creating an
> hmm_fence we can wait later on. So busy waiting or special WAIT hardware
> commands are valid.
> 
> This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the same
> time on the hardware.
> 
> In other words we can have a high priority gfx queue running jobs based on
> dma_fences and a low priority compute queue running jobs based on
> hmm_fences.
> 
> Only when we switch from hmm_fence to dma_fence we need to block the
> submission until all the necessary resources (both memory as well as CUs)
> are available.
> 
> This is somewhat an extension to your gang submit idea.

What is hmm_fence ? You should not have fence with hmm at all.
So i am kind of scare now.

Cheers,
Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD

2021-01-13 Thread Jerome Glisse
On Wed, Jan 13, 2021 at 09:31:11PM +0100, Daniel Vetter wrote:
> On Wed, Jan 13, 2021 at 5:56 PM Jerome Glisse  wrote:
> > On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
> > > On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
> > > > Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> > > > > On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> > > > >> This is the first version of our HMM based shared virtual memory 
> > > > >> manager
> > > > >> for KFD. There are still a number of known issues that we're working 
> > > > >> through
> > > > >> (see below). This will likely lead to some pretty significant 
> > > > >> changes in
> > > > >> MMU notifier handling and locking on the migration code paths. So 
> > > > >> don't
> > > > >> get hung up on those details yet.
> > > > >>
> > > > >> But I think this is a good time to start getting feedback. We're 
> > > > >> pretty
> > > > >> confident about the ioctl API, which is both simple and extensible 
> > > > >> for the
> > > > >> future. (see patches 4,16) The user mode side of the API can be 
> > > > >> found here:
> > > > >> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> > > > >>
> > > > >> I'd also like another pair of eyes on how we're interfacing with the 
> > > > >> GPU VM
> > > > >> code in amdgpu_vm.c (see patches 12,13), retry page fault handling 
> > > > >> (24,25),
> > > > >> and some retry IRQ handling changes (32).
> > > > >>
> > > > >>
> > > > >> Known issues:
> > > > >> * won't work with IOMMU enabled, we need to dma_map all pages 
> > > > >> properly
> > > > >> * still working on some race conditions and random bugs
> > > > >> * performance is not great yet
> > > > > Still catching up, but I think there's another one for your list:
> > > > >
> > > > >  * hmm gpu context preempt vs page fault handling. I've had a short
> > > > >discussion about this one with Christian before the holidays, and 
> > > > > also
> > > > >some private chats with Jerome. It's nasty since no easy fix, much 
> > > > > less
> > > > >a good idea what's the best approach here.
> > > >
> > > > Do you have a pointer to that discussion or any more details?
> > >
> > > Essentially if you're handling an hmm page fault from the gpu, you can
> > > deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> > > submissions or compute contexts with dma_fence_wait. Which deadlocks if
> > > you can't preempt while you have that page fault pending. Two solutions:
> > >
> > > - your hw can (at least for compute ctx) preempt even when a page fault is
> > >   pending
> > >
> > > - lots of screaming in trying to come up with an alternate solution. They
> > >   all suck.
> > >
> > > Note that the dma_fence_wait is hard requirement, because we need that for
> > > mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> > > management. Which is the current "ttm is self-limited to 50% of system
> > > memory" limitation Christian is trying to lift. So that's really not
> > > a restriction we can lift, at least not in upstream where we need to also
> > > support old style hardware which doesn't have page fault support and
> > > really has no other option to handle memory management than
> > > dma_fence_wait.
> > >
> > > Thread was here:
> > >
> > > https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=e...@mail.gmail.com/
> > >
> > > There's a few ways to resolve this (without having preempt-capable
> > > hardware), but they're all supremely nasty.
> > > -Daniel
> > >
> >
> > I had a new idea, i wanted to think more about it but have not yet,
> > anyway here it is. Adding a new callback to dma fence which ask the
> > question can it dead lock ? Any time a GPU driver has pending page
> > fault (ie something calling into the mm) it answer yes, otherwise
> > no. The GPU shrinker would ask the question before waiting o

Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD

2021-01-13 Thread Jerome Glisse
On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
> On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
> > Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> > > On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> > >> This is the first version of our HMM based shared virtual memory manager
> > >> for KFD. There are still a number of known issues that we're working 
> > >> through
> > >> (see below). This will likely lead to some pretty significant changes in
> > >> MMU notifier handling and locking on the migration code paths. So don't
> > >> get hung up on those details yet.
> > >>
> > >> But I think this is a good time to start getting feedback. We're pretty
> > >> confident about the ioctl API, which is both simple and extensible for 
> > >> the
> > >> future. (see patches 4,16) The user mode side of the API can be found 
> > >> here:
> > >> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> > >>
> > >> I'd also like another pair of eyes on how we're interfacing with the GPU 
> > >> VM
> > >> code in amdgpu_vm.c (see patches 12,13), retry page fault handling 
> > >> (24,25),
> > >> and some retry IRQ handling changes (32).
> > >>
> > >>
> > >> Known issues:
> > >> * won't work with IOMMU enabled, we need to dma_map all pages properly
> > >> * still working on some race conditions and random bugs
> > >> * performance is not great yet
> > > Still catching up, but I think there's another one for your list:
> > >
> > >  * hmm gpu context preempt vs page fault handling. I've had a short
> > >discussion about this one with Christian before the holidays, and also
> > >some private chats with Jerome. It's nasty since no easy fix, much less
> > >a good idea what's the best approach here.
> > 
> > Do you have a pointer to that discussion or any more details?
> 
> Essentially if you're handling an hmm page fault from the gpu, you can
> deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> submissions or compute contexts with dma_fence_wait. Which deadlocks if
> you can't preempt while you have that page fault pending. Two solutions:
> 
> - your hw can (at least for compute ctx) preempt even when a page fault is
>   pending
> 
> - lots of screaming in trying to come up with an alternate solution. They
>   all suck.
> 
> Note that the dma_fence_wait is hard requirement, because we need that for
> mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> management. Which is the current "ttm is self-limited to 50% of system
> memory" limitation Christian is trying to lift. So that's really not
> a restriction we can lift, at least not in upstream where we need to also
> support old style hardware which doesn't have page fault support and
> really has no other option to handle memory management than
> dma_fence_wait.
> 
> Thread was here:
> 
> https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=e...@mail.gmail.com/
> 
> There's a few ways to resolve this (without having preempt-capable
> hardware), but they're all supremely nasty.
> -Daniel
> 

I had a new idea, i wanted to think more about it but have not yet,
anyway here it is. Adding a new callback to dma fence which ask the
question can it dead lock ? Any time a GPU driver has pending page
fault (ie something calling into the mm) it answer yes, otherwise
no. The GPU shrinker would ask the question before waiting on any
dma-fence and back of if it gets yes. Shrinker can still try many
dma buf object for which it does not get a yes on associated fence.

This does not solve the mmu notifier case, for this you would just
invalidate the gem userptr object (with a flag but not releasing the
page refcount) but you would not wait for the GPU (ie no dma fence
wait in that code path anymore). The userptr API never really made
the contract that it will always be in sync with the mm view of the
world so if different page get remapped to same virtual address
while GPU is still working with the old pages it should not be an
issue (it would not be in our usage of userptr for compositor and
what not).

Maybe i overlook something there.

Cheers,
Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD

2021-01-13 Thread Jerome Glisse
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> This is the first version of our HMM based shared virtual memory manager
> for KFD. There are still a number of known issues that we're working through
> (see below). This will likely lead to some pretty significant changes in
> MMU notifier handling and locking on the migration code paths. So don't
> get hung up on those details yet.

[...]

> Known issues:
> * won't work with IOMMU enabled, we need to dma_map all pages properly
> * still working on some race conditions and random bugs
> * performance is not great yet

What would those changes looks like ? Seeing the issue below i do not
see how they inter-play with mmu notifier. Can you elaborate.

Cheers,
Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-22 Thread Jerome Glisse
On Mon, Jun 22, 2020 at 08:46:17AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 04:31:47PM -0400, Jerome Glisse wrote:
> > Not doable as page refcount can change for things unrelated to GUP, with
> > John changes we can identify GUP and we could potentialy copy GUPed page
> > instead of COW but this can potentialy slow down fork() and i am not sure
> > how acceptable this would be. Also this does not solve GUP against page
> > that are already in fork tree ie page P0 is in process A which forks,
> > we now have page P0 in process A and B. Now we have process A which forks
> > again and we have page P0 in A, B, and C. Here B and C are two branches
> > with root in A. B and/or C can keep forking and grow the fork tree.
> 
> For a long time now RDMA has broken COW pages when creating user DMA
> regions.
> 
> The problem has been that fork re-COW's regions that had their COW
> broken.
> 
> So, if you break the COW upon mapping and prevent fork (and others)
> from copying DMA pinned then you'd cover the cases.

I am not sure we want to prevent COW for pinned GUP pages, this would
change current semantic and potentialy break/slow down existing apps.

Anyway i think we focus too much on fork/COW, it is just an unfixable
broken corner cases, mmu notifier allows you to avoid it. Forcing real
copy on fork would likely be seen as regression by most people.


> > Semantic was change with 17839856fd588f4ab6b789f482ed3ffd7c403e1f to some
> > what "fix" that but GUP fast is still succeptible to this.
> 
> Ah, so everyone breaks the COW now, not just RDMA..
> 
> What do you mean 'GUP fast is still succeptible to this' ?

Not all GUP fast path are updated (intentionaly) __get_user_pages_fast()
for instance still keeps COW intact. People using GUP should really knows
what they are doing.

Cheers,
Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Fri, Jun 19, 2020 at 10:43:20PM +0200, Daniel Vetter wrote:
> On Fri, Jun 19, 2020 at 10:10 PM Jerome Glisse  wrote:
> >
> > On Fri, Jun 19, 2020 at 03:18:49PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
> > > > On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> > > > > On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> > > > >
> > > > > > The madness is only that device B's mmu notifier might need to wait
> > > > > > for fence_B so that the dma operation finishes. Which in turn has to
> > > > > > wait for device A to finish first.
> > > > >
> > > > > So, it sound, fundamentally you've got this graph of operations across
> > > > > an unknown set of drivers and the kernel cannot insert itself in
> > > > > dma_fence hand offs to re-validate any of the buffers involved?
> > > > > Buffers which by definition cannot be touched by the hardware yet.
> > > > >
> > > > > That really is a pretty horrible place to end up..
> > > > >
> > > > > Pinning really is right answer for this kind of work flow. I think
> > > > > converting pinning to notifers should not be done unless notifier
> > > > > invalidation is relatively bounded.
> > > > >
> > > > > I know people like notifiers because they give a bit nicer performance
> > > > > in some happy cases, but this cripples all the bad cases..
> > > > >
> > > > > If pinning doesn't work for some reason maybe we should address that?
> > > >
> > > > Note that the dma fence is only true for user ptr buffer which predate
> > > > any HMM work and thus were using mmu notifier already. You need the
> > > > mmu notifier there because of fork and other corner cases.
> > >
> > > I wonder if we should try to fix the fork case more directly - RDMA
> > > has this same problem and added MADV_DONTFORK a long time ago as a
> > > hacky way to deal with it.
> > >
> > > Some crazy page pin that resolved COW in a way that always kept the
> > > physical memory with the mm that initiated the pin?
> >
> > Just no way to deal with it easily, i thought about forcing the
> > anon_vma (page->mapping for anonymous page) to the anon_vma that
> > belongs to the vma against which the GUP was done but it would
> > break things if page is already in other branch of a fork tree.
> > Also this forbid fast GUP.
> >
> > Quite frankly the fork was not the main motivating factor. GPU
> > can pin potentialy GBytes of memory thus we wanted to be able
> > to release it but since Michal changes to reclaim code this is
> > no longer effective.
> 
> What where how? My patch to annote reclaim paths with mmu notifier
> possibility just landed in -mm, so if direct reclaim can't reclaim mmu
> notifier'ed stuff anymore we need to know.
> 
> Also this would resolve the entire pain we're discussing in this
> thread about dma_fence_wait deadlocking against anything that's not
> GFP_ATOMIC ...

Sorry my bad, reclaim still works, only oom skip. It was couple
years ago and i thought that some of the things discuss while
back did make it upstream.

It is probably a good time to also point out that what i wanted
to do is have all the mmu notifier callback provide some kind
of fence (not dma fence) so that we can split the notification
into step:
A- schedule notification on all devices/system get fences
   this step should minimize lock dependency and should
   not have to wait for anything also best if you can avoid
   memory allocation for instance by pre-allocating what
   you need for notification.
B- mm can do things like unmap but can not map new page
   so write special swap pte to cpu page table
C- wait on each fences from A
... resume old code ie replace pte or finish unmap ...

The idea here is that at step C the core mm can decide to back
off if any fence returned from A have to wait. This means that
every device is invalidating for nothing but if we get there
then it might still be a good thing as next time around maybe
the kernel would be successfull without a wait.

This would allow things like reclaim to make forward progress
and skip over or limit wait time to given timeout.

Also I thought to extend this even to multi-cpu tlb flush so
that device and CPUs follow same pattern and we can make //
progress on each.


Getting to such scheme is a lot of work. My plan was to first
get the fence as part of the notifier user API and hide it from
mm inside notifier common code. Then update each core mm path to
new model and see if there is any benefit from it. Reclaim would
be first candidate.

Cheers,
Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Fri, Jun 19, 2020 at 04:55:38PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 03:48:49PM -0400, Felix Kuehling wrote:
> > Am 2020-06-19 um 2:18 p.m. schrieb Jason Gunthorpe:
> > > On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
> > >> On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> > >>> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> > >>>
> > >>>> The madness is only that device B's mmu notifier might need to wait
> > >>>> for fence_B so that the dma operation finishes. Which in turn has to
> > >>>> wait for device A to finish first.
> > >>> So, it sound, fundamentally you've got this graph of operations across
> > >>> an unknown set of drivers and the kernel cannot insert itself in
> > >>> dma_fence hand offs to re-validate any of the buffers involved?
> > >>> Buffers which by definition cannot be touched by the hardware yet.
> > >>>
> > >>> That really is a pretty horrible place to end up..
> > >>>
> > >>> Pinning really is right answer for this kind of work flow. I think
> > >>> converting pinning to notifers should not be done unless notifier
> > >>> invalidation is relatively bounded. 
> > >>>
> > >>> I know people like notifiers because they give a bit nicer performance
> > >>> in some happy cases, but this cripples all the bad cases..
> > >>>
> > >>> If pinning doesn't work for some reason maybe we should address that?
> > >> Note that the dma fence is only true for user ptr buffer which predate
> > >> any HMM work and thus were using mmu notifier already. You need the
> > >> mmu notifier there because of fork and other corner cases.
> > > I wonder if we should try to fix the fork case more directly - RDMA
> > > has this same problem and added MADV_DONTFORK a long time ago as a
> > > hacky way to deal with it.
> > >
> > > Some crazy page pin that resolved COW in a way that always kept the
> > > physical memory with the mm that initiated the pin?
> > >
> > > (isn't this broken for O_DIRECT as well anyhow?)
> > >
> > > How does mmu_notifiers help the fork case anyhow? Block fork from
> > > progressing?
> > 
> > How much the mmu_notifier blocks fork progress depends, on quickly we
> > can preempt GPU jobs accessing affected memory. If we don't have
> > fine-grained preemption capability (graphics), the best we can do is
> > wait for the GPU jobs to complete. We can also delay submission of new
> > GPU jobs to the same memory until the MMU notifier is done. Future jobs
> > would use the new page addresses.
> > 
> > With fine-grained preemption (ROCm compute), we can preempt GPU work on
> > the affected adders space to minimize the delay seen by fork.
> > 
> > With recoverable device page faults, we can invalidate GPU page table
> > entries, so device access to the affected pages stops immediately.
> > 
> > In all cases, the end result is, that the device page table gets updated
> > with the address of the copied pages before the GPU accesses the COW
> > memory again.Without the MMU notifier, we'd end up with the GPU
> > corrupting memory of the other process.
> 
> The model here in fork has been wrong for a long time, and I do wonder
> how O_DIRECT manages to not be broken too.. I guess the time windows
> there are too small to get unlucky.

This was discuss extensively in the GUP works John have been doing.
Yes O_DIRECT can potentialy break but only if you are writting to
COW pages and you initiated the O_DIRECT right before the fork and
GUP happen before fork was able to write protect the pages.

If you O_DIRECT but use memory as input ie you are writting the
memory to the file not reading from the file. Then fork is harmless
as you are just reading memory. You can still face the COW uncertainty
(the process against which you did the O_DIRECT get "new" pages but your
O_DIRECT goes on with the "old" pages) but doing O_DIRECT and fork
concurently is asking for trouble.

> 
> If you have a write pin on a page then it should not be COW'd into the
> fork'd process but copied with the originating page remaining with the
> original mm.
> 
> I wonder if there is some easy way to achive that - if that is the
> main reason to use notifiers then it would be a better solution.

Not doable as page refcount can change for things unrelated to GUP, with
John changes we can identify GUP and we could potentialy copy GUPed 

Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Fri, Jun 19, 2020 at 03:18:49PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
> > On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> > > 
> > > > The madness is only that device B's mmu notifier might need to wait
> > > > for fence_B so that the dma operation finishes. Which in turn has to
> > > > wait for device A to finish first.
> > > 
> > > So, it sound, fundamentally you've got this graph of operations across
> > > an unknown set of drivers and the kernel cannot insert itself in
> > > dma_fence hand offs to re-validate any of the buffers involved?
> > > Buffers which by definition cannot be touched by the hardware yet.
> > > 
> > > That really is a pretty horrible place to end up..
> > > 
> > > Pinning really is right answer for this kind of work flow. I think
> > > converting pinning to notifers should not be done unless notifier
> > > invalidation is relatively bounded. 
> > > 
> > > I know people like notifiers because they give a bit nicer performance
> > > in some happy cases, but this cripples all the bad cases..
> > > 
> > > If pinning doesn't work for some reason maybe we should address that?
> > 
> > Note that the dma fence is only true for user ptr buffer which predate
> > any HMM work and thus were using mmu notifier already. You need the
> > mmu notifier there because of fork and other corner cases.
> 
> I wonder if we should try to fix the fork case more directly - RDMA
> has this same problem and added MADV_DONTFORK a long time ago as a
> hacky way to deal with it.
>
> Some crazy page pin that resolved COW in a way that always kept the
> physical memory with the mm that initiated the pin?

Just no way to deal with it easily, i thought about forcing the
anon_vma (page->mapping for anonymous page) to the anon_vma that
belongs to the vma against which the GUP was done but it would
break things if page is already in other branch of a fork tree.
Also this forbid fast GUP.

Quite frankly the fork was not the main motivating factor. GPU
can pin potentialy GBytes of memory thus we wanted to be able
to release it but since Michal changes to reclaim code this is
no longer effective.

User buffer should never end up in those weird corner case, iirc
the first usage was for xorg exa texture upload, then generalize
to texture upload in mesa and latter on to more upload cases
(vertices, ...). At least this is what i remember today. So in
those cases we do not expect fork, splice, mremap, mprotect, ...

Maybe we can audit how user ptr buffer are use today and see if
we can define a usage pattern that would allow to cut corner in
kernel. For instance we could use mmu notifier just to block CPU
pte update while we do GUP and thus never wait on dma fence.

Then GPU driver just keep the GUP pin around until they are done
with the page. They can also use the mmu notifier to keep a flag
so that the driver know if it needs to redo a GUP ie:

The notifier path:
   GPU_mmu_notifier_start_callback(range)
gpu_lock_cpu_pagetable(range)
for_each_bo_in(bo, range) {
bo->need_gup = true;
}
gpu_unlock_cpu_pagetable(range)

   GPU_validate_buffer_pages(bo)
if (!bo->need_gup)
return;
put_pages(bo->pages);
range = bo_vaddr_range(bo)
gpu_lock_cpu_pagetable(range)
GUP(bo->pages, range)
gpu_unlock_cpu_pagetable(range)


Depending on how user_ptr are use today this could work.


> (isn't this broken for O_DIRECT as well anyhow?)

Yes it can in theory, if you have an application that does O_DIRECT
and fork concurrently (ie O_DIRECT in one thread and fork in another).
Note that O_DIRECT after fork is fine, it is an issue only if GUP_fast
was able to lookup a page with write permission before fork had the
chance to update it to read only for COW.

But doing O_DIRECT (or anything that use GUP fast) in one thread and
fork in another is inherently broken ie there is no way to fix it.

See 17839856fd588f4ab6b789f482ed3ffd7c403e1f

> 
> How does mmu_notifiers help the fork case anyhow? Block fork from
> progressing?

It enforce ordering between fork and GUP, if fork is first it blocks
GUP and if forks is last then fork waits on GUP and then user buffer
get invalidated.

> 
> > I probably need to warn AMD folks again that using HMM means that you
> > must be able to update the GPU page table asynchronously without
> > fence wait.
> 
> It is kind of unrelated to HMM, it just shouldn't be using mmu
> notifiers to replace page pinning..

Well my POV is that if you abide by rules HMM defined then y

Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Fri, Jun 19, 2020 at 03:30:32PM -0400, Felix Kuehling wrote:
> 
> Am 2020-06-19 um 3:11 p.m. schrieb Alex Deucher:
> > On Fri, Jun 19, 2020 at 2:09 PM Jerome Glisse  wrote:
> >> On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> >>> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> >>>
> >>>> The madness is only that device B's mmu notifier might need to wait
> >>>> for fence_B so that the dma operation finishes. Which in turn has to
> >>>> wait for device A to finish first.
> >>> So, it sound, fundamentally you've got this graph of operations across
> >>> an unknown set of drivers and the kernel cannot insert itself in
> >>> dma_fence hand offs to re-validate any of the buffers involved?
> >>> Buffers which by definition cannot be touched by the hardware yet.
> >>>
> >>> That really is a pretty horrible place to end up..
> >>>
> >>> Pinning really is right answer for this kind of work flow. I think
> >>> converting pinning to notifers should not be done unless notifier
> >>> invalidation is relatively bounded.
> >>>
> >>> I know people like notifiers because they give a bit nicer performance
> >>> in some happy cases, but this cripples all the bad cases..
> >>>
> >>> If pinning doesn't work for some reason maybe we should address that?
> >> Note that the dma fence is only true for user ptr buffer which predate
> >> any HMM work and thus were using mmu notifier already. You need the
> >> mmu notifier there because of fork and other corner cases.
> >>
> >> For nouveau the notifier do not need to wait for anything it can update
> >> the GPU page table right away. Modulo needing to write to GPU memory
> >> using dma engine if the GPU page table is in GPU memory that is not
> >> accessible from the CPU but that's never the case for nouveau so far
> >> (but i expect it will be at one point).
> >>
> >>
> >> So i see this as 2 different cases, the user ptr case, which does pin
> >> pages by the way, where things are synchronous. Versus the HMM cases
> >> where everything is asynchronous.
> >>
> >>
> >> I probably need to warn AMD folks again that using HMM means that you
> >> must be able to update the GPU page table asynchronously without
> >> fence wait. The issue for AMD is that they already update their GPU
> >> page table using DMA engine. I believe this is still doable if they
> >> use a kernel only DMA engine context, where only kernel can queue up
> >> jobs so that you do not need to wait for unrelated things and you can
> >> prioritize GPU page table update which should translate in fast GPU
> >> page table update without DMA fence.
> > All devices which support recoverable page faults also have a
> > dedicated paging engine for the kernel driver which the driver already
> > makes use of.  We can also update the GPU page tables with the CPU.
> 
> We have a potential problem with CPU updating page tables while the GPU
> is retrying on page table entries because 64 bit CPU transactions don't
> arrive in device memory atomically.
> 
> We are using SDMA for page table updates. This currently goes through a
> the DRM GPU scheduler to a special SDMA queue that's used by kernel-mode
> only. But since it's based on the DRM GPU scheduler, we do use dma-fence
> to wait for completion.

Yeah my worry is mostly that some cross dma fence leak into it but
it should never happen realy, maybe there is a way to catch if it
does and print a warning.

So yes you can use dma fence, as long as they do not have cross-dep.
Another expectation is that they complete quickly and usualy page
table update do.

Cheers,
Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Thu, Jun 11, 2020 at 07:35:35PM -0400, Felix Kuehling wrote:
> Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:
> > On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:
> >>> I still have my doubts about allowing fence waiting from within shrinkers.
> >>> IMO ideally they should use a trywait approach, in order to allow memory
> >>> allocation during command submission for drivers that
> >>> publish fences before command submission. (Since early reservation object
> >>> release requires that).
> >> Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up
> >> with a mempool to make sure it can handle it's allocations.
> >>
> >>> But since drivers are already waiting from within shrinkers and I take 
> >>> your
> >>> word for HMM requiring this,
> >> Yeah the big trouble is HMM and mmu notifiers. That's the really awkward
> >> one, the shrinker one is a lot less established.
> > I really question if HW that needs something like DMA fence should
> > even be using mmu notifiers - the best use is HW that can fence the
> > DMA directly without having to get involved with some command stream
> > processing.
> >
> > Or at the very least it should not be a generic DMA fence but a
> > narrowed completion tied only into the same GPU driver's command
> > completion processing which should be able to progress without
> > blocking.
> >
> > The intent of notifiers was never to endlessly block while vast
> > amounts of SW does work.
> >
> > Going around and switching everything in a GPU to GFP_ATOMIC seems
> > like bad idea.
> >
> >> I've pinged a bunch of armsoc gpu driver people and ask them how much this
> >> hurts, so that we have a clear answer. On x86 I don't think we have much
> >> of a choice on this, with userptr in amd and i915 and hmm work in nouveau
> >> (but nouveau I think doesn't use dma_fence in there). 
> 
> Soon nouveau will get company. We're working on a recoverable page fault
> implementation for HMM in amdgpu where we'll need to update page tables
> using the GPUs SDMA engine and wait for corresponding fences in MMU
> notifiers.

Note that HMM mandate, and i stressed that several time in the past,
that all GPU page table update are asynchronous and do not have to
wait on _anything_.

I understand that you use DMA engine for GPU page table update but
if you want to do so with HMM then you need a GPU page table update
only DMA context where all GPU page table update goes through and
where user space can not queue up job.

It can be for HMM only but if you want to mix HMM with non HMM then
everything need to be on that queue and other command queue will have
to depends on it.

Cheers,
Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> 
> > The madness is only that device B's mmu notifier might need to wait
> > for fence_B so that the dma operation finishes. Which in turn has to
> > wait for device A to finish first.
> 
> So, it sound, fundamentally you've got this graph of operations across
> an unknown set of drivers and the kernel cannot insert itself in
> dma_fence hand offs to re-validate any of the buffers involved?
> Buffers which by definition cannot be touched by the hardware yet.
> 
> That really is a pretty horrible place to end up..
> 
> Pinning really is right answer for this kind of work flow. I think
> converting pinning to notifers should not be done unless notifier
> invalidation is relatively bounded. 
> 
> I know people like notifiers because they give a bit nicer performance
> in some happy cases, but this cripples all the bad cases..
> 
> If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate
any HMM work and thus were using mmu notifier already. You need the
mmu notifier there because of fork and other corner cases.

For nouveau the notifier do not need to wait for anything it can update
the GPU page table right away. Modulo needing to write to GPU memory
using dma engine if the GPU page table is in GPU memory that is not
accessible from the CPU but that's never the case for nouveau so far
(but i expect it will be at one point).


So i see this as 2 different cases, the user ptr case, which does pin
pages by the way, where things are synchronous. Versus the HMM cases
where everything is asynchronous.


I probably need to warn AMD folks again that using HMM means that you
must be able to update the GPU page table asynchronously without
fence wait. The issue for AMD is that they already update their GPU
page table using DMA engine. I believe this is still doable if they
use a kernel only DMA engine context, where only kernel can queue up
jobs so that you do not need to wait for unrelated things and you can
prioritize GPU page table update which should translate in fast GPU
page table update without DMA fence.


> > Full disclosure: We are aware that we've designed ourselves into an
> > impressive corner here, and there's lots of talks going on about
> > untangling the dma synchronization from the memory management
> > completely. But
> 
> I think the documenting is really important: only GPU should be using
> this stuff and driving notifiers this way. Complete NO for any
> totally-not-a-GPU things in drivers/accel for sure.

Yes for user that expect HMM they need to be asynchronous. But it is
hard to revert user ptr has it was done a long time ago.

Cheers,
Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [GIT PULL] Please pull hmm changes

2019-12-05 Thread Jerome Glisse
On Tue, Dec 03, 2019 at 02:42:12AM +, Jason Gunthorpe wrote:
> On Sat, Nov 30, 2019 at 10:23:31AM -0800, Linus Torvalds wrote:
> > On Sat, Nov 30, 2019 at 10:03 AM Linus Torvalds
> >  wrote:
> > >
> > > I'll try to figure the code out, but my initial reaction was "yeah,
> > > not in my VM".
> > 
> > Why is it ok to sometimes do
> > 
> > WRITE_ONCE(mni->invalidate_seq, cur_seq);
> > 
> > (to pair with the unlocked READ_ONCE), and sometimes then do
> > 
> > mni->invalidate_seq = mmn_mm->invalidate_seq;
> > 
> > My initial guess was that latter is only done at initialization time,
> > but at least in one case it's done *after* the mni has been added to
> > the mmn_mm (oh, how I despise those names - I can only repeat: WTF?).
> 
> Yes, the only occurrences are in the notifier_insert, under the
> spinlock. The one case where it is out of the natural order was to
> make the manipulation of seq a bit saner, but in all cases since the
> spinlock is held there is no way for another thread to get the pointer
> to the 'mmu_interval_notifier *' to do the unlocked read.
> 
> Regarding the ugly names.. Naming has been really hard here because
> currently everything is a 'mmu notifier' and the natural abberviations
> from there are crummy. Here is the basic summary:
> 
> struct mmu_notifier_mm (ie the mm->mmu_notifier_mm)
>-> mmn_mm
> struct mm_struct 
>-> mm
> struct mmu_notifier (ie the user subscription to the mm_struct)
>-> mn
> struct mmu_interval_notifier (the other kind of user subscription)
>-> mni

What about "interval" the context should already tell people
it is related to mmu notifier and thus a notifier. I would
just remove the notifier suffix, this would match the below
range.

> struct mmu_notifier_range (ie the args to invalidate_range)
>-> range

Yeah range as context should tell you it is related to mmu
notifier.

> 
> I can send a patch to switch mmn_mm to mmu_notifier_mm, which is the
> only pre-existing name for this value. But IIRC, it is a somewhat ugly
> with long line wrapping. 'mni' is a pain, I have to reflect on that.
> (honesly, I dislike mmu_notififer_mm quite a lot too)
> 
> I think it would be overall nicer with better names for the original
> structs. Perhaps:
> 
>  mmn_* - MMU notifier prefix
>  mmn_state <- struct mmu_notifier_mm
>  mmn_subscription (mmn_sub) <- struct mmu_notifier
>  mmn_range_subscription (mmn_range_sub) <- struct mmu_interval_notifier
>  mmn_invalidate_desc <- struct mmu_notifier_range

This looks good.

> 
> At least this is how I describe them in my mind..  This is a lot of
> churn, and spreads through many drivers. This is why I kept the names
> as-is and we ended up with the also quite bad 'mmu_interval_notifier'
> 
> Maybe just switch mmu_notifier_mm for mmn_state and leave the drivers
> alone?
> 
> Anyone on the CC list have advice?

Maybe we can do a semantic patch to do convertion and then Linus
can easily apply the patch by just re-running the coccinelle.

Cheers,
Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-11-08 Thread Jerome Glisse
On Thu, Nov 07, 2019 at 10:33:02PM -0800, Christoph Hellwig wrote:
> On Thu, Nov 07, 2019 at 08:06:08PM +, Jason Gunthorpe wrote:
> > > 
> > > enum mmu_range_notifier_event {
> > >   MMU_NOTIFY_RELEASE,
> > > };
> > > 
> > > ...assuming that we stay with "mmu_range_notifier" as a core name for 
> > > this 
> > > whole thing.
> > > 
> > > Also, it is best moved down to be next to the new MNR structs, so that 
> > > all the
> > > MNR stuff is in one group.
> > 
> > I agree with Jerome, this enum is part of the 'struct
> > mmu_notifier_range' (ie the description of the invalidation) and it
> > doesn't really matter that only these new notifiers can be called with
> > this type, it is still part of the mmu_notifier_range.
> > 
> > The comment already says it only applies to the mmu_range_notifier
> > scheme..
> 
> In fact the enum is entirely unused.  We might as well just kill it off
> entirely.

I had patches to use it, i need to re-post them. I posted them long ago
and i droped the ball. I will re-spin after this.

Cheers,
Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-11-07 Thread Jerome Glisse
On Fri, Nov 08, 2019 at 12:32:25AM +, Jason Gunthorpe wrote:
> On Thu, Nov 07, 2019 at 04:04:08PM -0500, Jerome Glisse wrote:
> > On Thu, Nov 07, 2019 at 08:11:06PM +, Jason Gunthorpe wrote:
> > > On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:
> > > 
> > > > > 
> > > > > Extra credit: IMHO, this clearly deserves to all be in a new 
> > > > > mmu_range_notifier.h
> > > > > header file, but I know that's extra work. Maybe later as a follow-up 
> > > > > patch,
> > > > > if anyone has the time.
> > > > 
> > > > The range notifier should get the event too, it would be a waste, i 
> > > > think it is
> > > > an oversight here. The release event is fine so NAK to you separate 
> > > > event. Event
> > > > is really an helper for notifier i had a set of patch for nouveau to 
> > > > leverage
> > > > this i need to resucite them. So no need to split thing, i would just 
> > > > forward
> > > > the event ie add event to mmu_range_notifier_ops.invalidate() i failed 
> > > > to catch
> > > > that in v1 sorry.
> > > 
> > > I think what you mean is already done?
> > > 
> > > struct mmu_range_notifier_ops {
> > >   bool (*invalidate)(struct mmu_range_notifier *mrn,
> > >  const struct mmu_notifier_range *range,
> > >  unsigned long cur_seq);
> > 
> > Yes it is sorry, i got confuse with mmu_range_notifier and 
> > mmu_notifier_range :)
> > It is almost a palyndrome structure ;)
> 
> Lets change the name then, this is clearly not working. I'll reflow
> everything tomorrow

Semantic patch to do that run from your linux kernel directory with your patch
applied (you can run it one patch after the other and the git commit -a --fixup 
HEAD)

spatch --sp-file name-of-the-file-below --dir . --all-includes --in-place

%< --
@@
@@
struct
-mmu_range_notifier
+mmu_interval_notifier

@@
@@
struct
-mmu_range_notifier
+mmu_interval_notifier
{...};

// Change mrn name to mmu_in
@@
struct mmu_interval_notifier *mrn;
@@
-mrn
+mmu_in

@@
identifier fn;
@@
fn(..., 
-struct mmu_interval_notifier *mrn,
+struct mmu_interval_notifier *mmu_in,
...) {...}
-- >%

You need coccinelle (which provides spatch). It is untested but it should work
also i could not come up with a nice name to update mrn as min is way too
confusing. If you have better name feel free to use it.

Oh and coccinelle is pretty clever about code formating so it should do a good
jobs at keeping things nicely formated and align.

Cheers,
Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-11-07 Thread Jerome Glisse
On Thu, Nov 07, 2019 at 08:11:06PM +, Jason Gunthorpe wrote:
> On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:
> 
> > > 
> > > Extra credit: IMHO, this clearly deserves to all be in a new 
> > > mmu_range_notifier.h
> > > header file, but I know that's extra work. Maybe later as a follow-up 
> > > patch,
> > > if anyone has the time.
> > 
> > The range notifier should get the event too, it would be a waste, i think 
> > it is
> > an oversight here. The release event is fine so NAK to you separate event. 
> > Event
> > is really an helper for notifier i had a set of patch for nouveau to 
> > leverage
> > this i need to resucite them. So no need to split thing, i would just 
> > forward
> > the event ie add event to mmu_range_notifier_ops.invalidate() i failed to 
> > catch
> > that in v1 sorry.
> 
> I think what you mean is already done?
> 
> struct mmu_range_notifier_ops {
>   bool (*invalidate)(struct mmu_range_notifier *mrn,
>  const struct mmu_notifier_range *range,
>  unsigned long cur_seq);

Yes it is sorry, i got confuse with mmu_range_notifier and mmu_notifier_range :)
It is almost a palyndrome structure ;)

> 
> > No it is always odd, you must call mmu_range_set_seq() only from the
> > op->invalidate_range() callback at which point the seq is odd. As well
> > when mrn is added and its seq first set it is set to an odd value
> > always. Maybe the comment, should read:
> > 
> >  * mrn->invalidate_seq is always, yes always, set to an odd value. This 
> > ensures
> > 
> > To stress that it is not an error.
> 
> I went with this:
> 
>   /*
>* mrn->invalidate_seq must always be set to an odd value via
>* mmu_range_set_seq() using the provided cur_seq from
>* mn_itree_inv_start_range(). This ensures that if seq does wrap we
>* will always clear the below sleep in some reasonable time as
>* mmn_mm->invalidate_seq is even in the idle state.
>*/

Yes fine with me.

[...]

> > > > +   might_lock(>mmap_sem);
> > > > +
> > > > +   mmn_mm = smp_load_acquire(>mmu_notifier_mm);
> > > 
> > > What does the above pair with? Should have a comment that specifies that.
> > 
> > It was discussed in v1 but maybe a comment of what was said back then would
> > be helpful. Something like:
> > 
> > /*
> >  * We need to insure that all writes to mm->mmu_notifier_mm are visible 
> > before
> >  * any checks we do on mmn_mm below as otherwise CPU might re-order write 
> > done
> >  * by another CPU core to mm->mmu_notifier_mm structure fields after the 
> > read
> >  * belows.
> >  */
> 
> This comment made it, just at the store side:
> 
>   /*
>* Serialize the update against mmu_notifier_unregister. A
>* side note: mmu_notifier_release can't run concurrently with
>* us because we hold the mm_users pin (either implicitly as
>* current->mm or explicitly with get_task_mm() or similar).
>* We can't race against any other mmu notifier method either
>* thanks to mm_take_all_locks().
>*
>* release semantics on the initialization of the mmu_notifier_mm's
>  * contents are provided for unlocked readers.  acquire can only be
>  * used while holding the mmgrab or mmget, and is safe because once
>  * created the mmu_notififer_mm is not freed until the mm is
>  * destroyed.  As above, users holding the mmap_sem or one of the
>  * mm_take_all_locks() do not need to use acquire semantics.
>*/
>   if (mmu_notifier_mm)
>   smp_store_release(>mmu_notifier_mm, mmu_notifier_mm);
> 
> Which I think is really overly belaboring the typical smp
> store/release pattern, but people do seem unfamiliar with them...

Perfect with me. I think also sometimes you forgot what memory model is
and thus store/release pattern do, i know i do and i need to refresh my
mind.

Cheers,
Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-11-06 Thread Jerome Glisse
On Wed, Nov 06, 2019 at 04:23:21PM -0800, John Hubbard wrote:
> On 10/28/19 1:10 PM, Jason Gunthorpe wrote:

[...]

> >  /**
> >   * enum mmu_notifier_event - reason for the mmu notifier callback
> > @@ -32,6 +34,9 @@ struct mmu_notifier_range;
> >   * access flags). User should soft dirty the page in the end callback to 
> > make
> >   * sure that anyone relying on soft dirtyness catch pages that might be 
> > written
> >   * through non CPU mappings.
> > + *
> > + * @MMU_NOTIFY_RELEASE: used during mmu_range_notifier invalidate to 
> > signal that
> > + * the mm refcount is zero and the range is no longer accessible.
> >   */
> >  enum mmu_notifier_event {
> > MMU_NOTIFY_UNMAP = 0,
> > @@ -39,6 +44,7 @@ enum mmu_notifier_event {
> > MMU_NOTIFY_PROTECTION_VMA,
> > MMU_NOTIFY_PROTECTION_PAGE,
> > MMU_NOTIFY_SOFT_DIRTY,
> > +   MMU_NOTIFY_RELEASE,
> >  };
> 
> 
> OK, let the naming debates begin! ha. Anyway, after careful study of the 
> overall
> patch, and some browsing of the larger patchset, it's clear that:
> 
> * The new "MMU range notifier" that you've created is, approximately, a new
> object. It uses classic mmu notifiers inside, as an implementation detail, and
> it does *similar* things (notifications) as mmn's. But it's certainly not the 
> same
> as mmn's, as shown later when you say the need to an entirely new ops struct, 
> and 
> data struct too.
> 
> Therefore, you need a separate events enum as well. This is important. MMN's
> won't be issuing MMN_NOTIFY_RELEASE events, nor will MNR's be issuing the 
> first
> four prexisting MMU_NOTIFY_* items. So it would be a design mistake to glom 
> them
> together, unless you ultimately decided to merge these MMN and MNR objects 
> (which
> I don't really see any intention of, and that's fine).
> 
> So this should read:
> 
> enum mmu_range_notifier_event {
>   MMU_NOTIFY_RELEASE,
> };
> 
> ...assuming that we stay with "mmu_range_notifier" as a core name for this 
> whole thing.
> 
> Also, it is best moved down to be next to the new MNR structs, so that all the
> MNR stuff is in one group.
> 
> Extra credit: IMHO, this clearly deserves to all be in a new 
> mmu_range_notifier.h
> header file, but I know that's extra work. Maybe later as a follow-up patch,
> if anyone has the time.

The range notifier should get the event too, it would be a waste, i think it is
an oversight here. The release event is fine so NAK to you separate event. Event
is really an helper for notifier i had a set of patch for nouveau to leverage
this i need to resucite them. So no need to split thing, i would just forward
the event ie add event to mmu_range_notifier_ops.invalidate() i failed to catch
that in v1 sorry.


[...]

> > +struct mmu_range_notifier_ops {
> > +   bool (*invalidate)(struct mmu_range_notifier *mrn,
> > +  const struct mmu_notifier_range *range,
> > +  unsigned long cur_seq);
> > +};
> > +
> > +struct mmu_range_notifier {
> > +   struct interval_tree_node interval_tree;
> > +   const struct mmu_range_notifier_ops *ops;
> > +   struct hlist_node deferred_item;
> > +   unsigned long invalidate_seq;
> > +   struct mm_struct *mm;
> > +};
> > +
> 
> Again, now we have the new struct mmu_range_notifier, and the old 
> struct mmu_notifier_range, and it's not good.
> 
> Ideas:
> 
> a) Live with it.
> 
> b) (Discarded, too many callers): rename old one. Nope.
> 
> c) Rename new one. Ideas:
> 
> struct mmu_interval_notifier
> struct mmu_range_intersection
> ...other ideas?

I vote for interval_notifier we do want notifier in name but i am also
fine with current name.

[...]

> > + *
> > + * Note that the core mm creates nested invalidate_range_start()/end() 
> > regions
> > + * within the same thread, and runs invalidate_range_start()/end() in 
> > parallel
> > + * on multiple CPUs. This is designed to not reduce concurrency or block
> > + * progress on the mm side.
> > + *
> > + * As a secondary function, holding the full write side also serves to 
> > prevent
> > + * writers for the itree, this is an optimization to avoid extra locking
> > + * during invalidate_range_start/end notifiers.
> > + *
> > + * The write side has two states, fully excluded:
> > + *  - mm->active_invalidate_ranges != 0
> > + *  - mnn->invalidate_seq & 1 == True
> > + *  - some range on the mm_struct is being invalidated
> > + *  - the itree is not allowed to change
> > + *
> > + * And partially excluded:
> > + *  - mm->active_invalidate_ranges != 0
> 
> I assume this implies mnn->invalidate_seq & 1 == False in this case? If so,
> let's say so. I'm probably getting that wrong, too.

Yes (mnn->invalidate_seq & 1) == 0

> 
> > + *  - some range on the mm_struct is being invalidated
> > + *  - the itree is allowed to change
> > + *
> > + * The later state avoids some expensive work on inv_end in the common 
> > case of
> > + * no mrn monitoring the VA.
> > + */
> > +static bool mn_itree_is_invalidating(struct 

Re: [PATCH hmm 00/15] Consolidate the mmu notifier interval_tree and locking

2019-10-23 Thread Jerome Glisse
On Mon, Oct 21, 2019 at 07:06:00PM +, Jason Gunthorpe wrote:
> On Mon, Oct 21, 2019 at 02:40:41PM -0400, Jerome Glisse wrote:
> > On Tue, Oct 15, 2019 at 03:12:27PM -0300, Jason Gunthorpe wrote:
> > > From: Jason Gunthorpe 
> > > 
> > > 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
> > > scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
> > > they only use invalidate_range_start/end and immediately check the
> > > invalidating range against some driver data structure to tell if the
> > > driver is interested. Half of them use an interval_tree, the others are
> > > simple linear search lists.
> > > 
> > > Of the ones I checked they largely seem to have various kinds of races,
> > > bugs and poor implementation. This is a result of the complexity in how
> > > the notifier interacts with get_user_pages(). It is extremely difficult to
> > > use it correctly.
> > > 
> > > Consolidate all of this code together into the core mmu_notifier and
> > > provide a locking scheme similar to hmm_mirror that allows the user to
> > > safely use get_user_pages() and reliably know if the page list still
> > > matches the mm.
> > > 
> > > This new arrangment plays nicely with the !blockable mode for
> > > OOM. Scanning the interval tree is done such that the intersection test
> > > will always succeed, and since there is no invalidate_range_end exposed to
> > > drivers the scheme safely allows multiple drivers to be subscribed.
> > > 
> > > Four places are converted as an example of how the new API is used.
> > > Four are left for future patches:
> > >  - i915_gem has complex locking around destruction of a registration,
> > >needs more study
> > >  - hfi1 (2nd user) needs access to the rbtree
> > >  - scif_dma has a complicated logic flow
> > >  - vhost's mmu notifiers are already being rewritten
> > > 
> > > This is still being tested, but I figured to send it to start getting help
> > > from the xen, amd and hfi drivers which I cannot test here.
> > 
> > It might be a good oportunity to also switch those users to
> > hmm_range_fault() instead of GUP as GUP is pointless for those
> > users. In fact the GUP is an impediment to normal mm operations.
> 
> I think vhost can use hmm_range_fault
> 
> hfi1 does actually need to have the page pin, it doesn't fence DMA
> during invalidate.
> 
> i915_gem feels alot like amdgpu, so probably it would benefit
> 
> No idea about scif_dma
> 
> > I will test on nouveau.
> 
> Thanks, hopefully it still works, I think Ralph was able to do some
> basic checks. But it is a pretty complicated series, I probably made
> some mistakes.

So it seems to work ok with nouveau, will let tests run in loop thought
there are not very advance test.

> 
> FWIW, I know that nouveau gets a lockdep splat now from Daniel
> Vetter's recent changes, it tries to do GFP_KERENEL allocations under
> a lock also held by the invalidate_range_start path.

I have not seen any splat so far, is it throught some new kernel config ?

Cheers,
Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH hmm 00/15] Consolidate the mmu notifier interval_tree and locking

2019-10-23 Thread Jerome Glisse
On Wed, Oct 23, 2019 at 11:32:16AM +0200, Christian König wrote:
> Am 23.10.19 um 11:08 schrieb Daniel Vetter:
> > On Tue, Oct 22, 2019 at 03:01:13PM +, Jason Gunthorpe wrote:
> > > On Tue, Oct 22, 2019 at 09:57:35AM +0200, Daniel Vetter wrote:
> > > 
> > > > > The unusual bit in all of this is using a lock's critical region to
> > > > > 'protect' data for read, but updating that same data before the lock's
> > > > > critical secion. ie relying on the unlock barrier to 'release' program
> > > > > ordered stores done before the lock's own critical region, and the
> > > > > lock side barrier to 'acquire' those stores.
> > > > I think this unusual use of locks as barriers for other unlocked 
> > > > accesses
> > > > deserves comments even more than just normal barriers. Can you pls add
> > > > them? I think the design seeems sound ...
> > > > 
> > > > Also the comment on the driver's lock hopefully prevents driver
> > > > maintainers from moving the driver_lock around in a way that would very
> > > > subtle break the scheme, so I think having the acquire barrier commented
> > > > in each place would be really good.
> > > There is already a lot of documentation, I think it would be helpful
> > > if you could suggest some specific places where you think an addition
> > > would help? I think the perspective of someone less familiar with this
> > > design would really improve the documentation
> > Hm I just meant the usual recommendation that "barriers must have comments
> > explaining what they order, and where the other side of the barrier is".
> > Using unlock/lock as a barrier imo just makes that an even better idea.
> > Usually what I do is something like "we need to order $this against $that
> > below, and the other side of this barrier is in function()." With maybe a
> > bit more if it's not obvious how things go wrong if the orderin is broken.
> > 
> > Ofc seqlock.h itself skimps on that rule and doesn't bother explaining its
> > barriers :-/
> > 
> > > I've been tempted to force the driver to store the seq number directly
> > > under the driver lock - this makes the scheme much clearer, ie
> > > something like this:
> > > 
> > > diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
> > > b/drivers/gpu/drm/nouveau/nouveau_svm.c
> > > index 712c99918551bc..738fa670dcfb19 100644
> > > --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> > > +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> > > @@ -488,7 +488,8 @@ struct svm_notifier {
> > >   };
> > >   static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn,
> > > -const struct mmu_notifier_range 
> > > *range)
> > > +const struct mmu_notifier_range 
> > > *range,
> > > +unsigned long seq)
> > >   {
> > >  struct svm_notifier *sn =
> > >  container_of(mrn, struct svm_notifier, notifier);
> > > @@ -504,6 +505,7 @@ static bool nouveau_svm_range_invalidate(struct 
> > > mmu_range_notifier *mrn,
> > >  mutex_lock(>svmm->mutex);
> > >  else if (!mutex_trylock(>svmm->mutex))
> > >  return false;
> > > +   mmu_range_notifier_update_seq(mrn, seq);
> > >  mutex_unlock(>svmm->mutex);
> > >  return true;
> > >   }
> > > 
> > > 
> > > At the cost of making the driver a bit more complex, what do you
> > > think?
> > Hm, spinning this further ... could we initialize the mmu range notifier
> > with a pointer to the driver lock, so that we could put a
> > lockdep_assert_held into mmu_range_notifier_update_seq? I think that would
> > make this scheme substantially more driver-hacker proof :-)
> 
> Going another step further what hinders us to put the lock into the mmu
> range notifier itself and have _lock()/_unlock() helpers?
> 
> I mean having the lock in the driver only makes sense when the driver would
> be using the same lock for multiple things, e.g. multiple MMU range
> notifiers under the same lock. But I really don't see that use case here.

I actualy do, nouveau use one lock to protect the page table and that's the
lock that matter. You can have multiple range for a single page table, idea
being only a sub-set of the process address space is ever accessed by the
GPU and those it is better to focus on this sub-set and track invalidation in
a finer grain.

Cheers,
Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH hmm 02/15] mm/mmu_notifier: add an interval tree notifier

2019-10-21 Thread Jerome Glisse
On Mon, Oct 21, 2019 at 07:24:53PM +, Jason Gunthorpe wrote:
> On Mon, Oct 21, 2019 at 03:11:57PM -0400, Jerome Glisse wrote:
> > > Since that reader is not locked we need release semantics here to
> > > ensure the unlocked reader sees a fully initinalized mmu_notifier_mm
> > > structure when it observes the pointer.
> > 
> > I thought the mm_take_all_locks() would have had a barrier and thus
> > that you could not see mmu_notifier struct partialy initialized. 
> 
> Not sure, usually a lock acquire doesn't have a store barrier?

Yeah likely.

> 
> Even if it did, we would still need some pairing read barrier..
> 
> > having the acquire/release as safety net does not hurt. Maybe add a
> > comment about the struct initialization needing to be visible before
> > pointer is set.
> 
> Is this clear?
> 
>  * release semantics on the initialization of the
>  * mmu_notifier_mm's contents are provided for unlocked readers.
>* acquire can only be used while holding the
>  * mmgrab or mmget, and is safe because once created the
>  * mmu_notififer_mm is not freed until the mm is destroyed.
>  * Users holding the mmap_sem or one of the
>* mm_take_all_locks() do not need to use acquire semantics.
> 
> It also helps explain why there is no locking around the other
> readers, which has puzzled me in the past at least.

Perfect.

Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH hmm 15/15] mm/hmm: remove hmm_mirror and related

2019-10-21 Thread Jerome Glisse
On Mon, Oct 21, 2019 at 06:57:42PM +, Jason Gunthorpe wrote:
> On Mon, Oct 21, 2019 at 02:38:24PM -0400, Jerome Glisse wrote:
> > On Tue, Oct 15, 2019 at 03:12:42PM -0300, Jason Gunthorpe wrote:
> > > From: Jason Gunthorpe 
> > > 
> > > The only two users of this are now converted to use mmu_range_notifier,
> > > delete all the code and update hmm.rst.
> > 
> > I guess i should point out that the reasons for hmm_mirror and hmm
> > was for:
> > 1) Maybe define a common API for userspace to provide memory
> >placement hints (NUMA for GPU)
> 
> Do you think this needs special code in the notifiers?

Just need a place where to hang userspace policy hint the hmm_range
was the prime suspect. I need to revisit this once the nouveau user
space is in better shape.

> 
> > 2) multi-devices sharing same mirror page table
> 
> Oh neat, but I think this just means the GPU driver has to register a
> single notifier for multiple GPUs??

Yes that was the idea a single notifier with share page table, but
at this time this is non existent code so no need to hinder change
just for the sake of it.

> 
> > But support for multi-GPU in nouveau is way behind and i guess such
> > optimization will have to re-materialize what is necessary once that
> > happens.
> 
> Sure, it will be easier to understand what is needed with a bit of
> code!
> 
> > Note this patch should also update kernel/fork.c and the mm_struct
> > definition AFAICT. With those changes you can add my:
> 
> Can you please elaborate what updates you mean? I'm not sure. 
> 
> Maybe I already got the things you are thinking of with the get/put
> changes?

Oh i forgot this was already taken care of by this. So yes all is
fine:

Reviewed-by: Jérôme Glisse 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH hmm 02/15] mm/mmu_notifier: add an interval tree notifier

2019-10-21 Thread Jerome Glisse
On Mon, Oct 21, 2019 at 06:54:25PM +, Jason Gunthorpe wrote:
> On Mon, Oct 21, 2019 at 02:30:56PM -0400, Jerome Glisse wrote:
> 
> > > +/**
> > > + * mmu_range_read_retry - End a read side critical section against a VA 
> > > range
> > > + * mrn: The range under lock
> > > + * seq: The return of the paired mmu_range_read_begin()
> > > + *
> > > + * This MUST be called under a user provided lock that is also held
> > > + * unconditionally by op->invalidate(). That lock provides the required 
> > > SMP
> > > + * barrier for handling invalidate_seq.
> > > + *
> > > + * Each call should be paired with a single mmu_range_read_begin() and
> > > + * should be used to conclude the read side.
> > > + *
> > > + * Returns true if an invalidation collided with this critical section, 
> > > and
> > > + * the caller should retry.
> > > + */
> > > +static inline bool mmu_range_read_retry(struct mmu_range_notifier *mrn,
> > > + unsigned long seq)
> > > +{
> > > + return READ_ONCE(mrn->invalidate_seq) != seq;
> > > +}
> > 
> > What about calling this mmu_range_read_end() instead ? To match
> > with the mmu_range_read_begin().
> 
> _end make some sense too, but I picked _retry for symmetry with the
> seqcount_* family of functions which used retry.
> 
> I think retry makes it clearer that it is expected to fail and retry
> is required.

Fair enough.

> 
> > > + /*
> > > +  * The inv_end incorporates a deferred mechanism like rtnl. Adds and
> > 
> > The rtnl reference is lost on people unfamiliar with the network :)
> > code maybe like rtnl_lock()/rtnl_unlock() so people have a chance to
> > grep the right function. Assuming i am myself getting the right
> > reference :)
> 
> Yep, you got it, I will update
> 
> > > + /*
> > > +  * mrn->invalidate_seq is always set to an odd value. This ensures
> > > +  * that if seq does wrap we will always clear the below sleep in some
> > > +  * reasonable time as mmn_mm->invalidate_seq is even in the idle
> > > +  * state.
> > 
> > I think this comment should be with the struct mmu_range_notifier
> > definition and you should just point to it from here as the same
> > comment would be useful down below.
> 
> I had it here because it is critical to understanding the wait_event
> and why it doesn't just block indefinitely, but yes this property
> comes up below too which refers back here.
> 
> Fundamentally this wait event is why this approach to keep an odd
> value in the mrn is used.

The comment is fine, it is just i read the patch out of order and
in insert function i was pondering on why it must be odd while the
explanation was here. It is more a taste thing, i prefer comments
about this to be part of the struct definition comments so that
multiple place can refer to the same struct definition it is more
resiliant to code change as struct definition is always easy to
find and thus reference to it can be sprinkle all over where it is
necessary.


> 
> > > -int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range 
> > > *range)
> > > +static int mn_itree_invalidate(struct mmu_notifier_mm *mmn_mm,
> > > +  const struct mmu_notifier_range *range)
> > > +{
> > > + struct mmu_range_notifier *mrn;
> > > + unsigned long cur_seq;
> > > +
> > > + for (mrn = mn_itree_inv_start_range(mmn_mm, range, _seq); mrn;
> > > +  mrn = mn_itree_inv_next(mrn, range)) {
> > > + bool ret;
> > > +
> > > + WRITE_ONCE(mrn->invalidate_seq, cur_seq);
> > > + ret = mrn->ops->invalidate(mrn, range);
> > > + if (!ret && !WARN_ON(mmu_notifier_range_blockable(range)))
> > 
> > Isn't the logic wrong here ? We want to warn if the range
> > was mark as blockable and invalidate returned false. Also
> > we went to backoff no matter what if the invalidate return
> > false ie:
> 
> If invalidate returned false and the caller is blockable then we do
> not want to return, we must continue processing other ranges - to try
> to cope with the defective driver.
> 
> Callers in blocking mode ignore the return value and go ahead to
> invalidate..
> 
> Would it be clearer as 
> 
> if (!ret) {
>if (WARN_ON(mmu_notifier_range_blockable(range)))
>continue;
>goto out_would_block;
> }
> 
> ?

Yes look clearer to me at least.

> 
> > > @@ -284,21 +589,22 @@ int __

Re: [PATCH hmm 00/15] Consolidate the mmu notifier interval_tree and locking

2019-10-21 Thread Jerome Glisse
On Tue, Oct 15, 2019 at 03:12:27PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
> scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
> they only use invalidate_range_start/end and immediately check the
> invalidating range against some driver data structure to tell if the
> driver is interested. Half of them use an interval_tree, the others are
> simple linear search lists.
> 
> Of the ones I checked they largely seem to have various kinds of races,
> bugs and poor implementation. This is a result of the complexity in how
> the notifier interacts with get_user_pages(). It is extremely difficult to
> use it correctly.
> 
> Consolidate all of this code together into the core mmu_notifier and
> provide a locking scheme similar to hmm_mirror that allows the user to
> safely use get_user_pages() and reliably know if the page list still
> matches the mm.
> 
> This new arrangment plays nicely with the !blockable mode for
> OOM. Scanning the interval tree is done such that the intersection test
> will always succeed, and since there is no invalidate_range_end exposed to
> drivers the scheme safely allows multiple drivers to be subscribed.
> 
> Four places are converted as an example of how the new API is used.
> Four are left for future patches:
>  - i915_gem has complex locking around destruction of a registration,
>needs more study
>  - hfi1 (2nd user) needs access to the rbtree
>  - scif_dma has a complicated logic flow
>  - vhost's mmu notifiers are already being rewritten
> 
> This is still being tested, but I figured to send it to start getting help
> from the xen, amd and hfi drivers which I cannot test here.

It might be a good oportunity to also switch those users to
hmm_range_fault() instead of GUP as GUP is pointless for those
users. In fact the GUP is an impediment to normal mm operations.

I will test on nouveau.

Cheers,
Jérôme

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH hmm 15/15] mm/hmm: remove hmm_mirror and related

2019-10-21 Thread Jerome Glisse
On Tue, Oct 15, 2019 at 03:12:42PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> The only two users of this are now converted to use mmu_range_notifier,
> delete all the code and update hmm.rst.

I guess i should point out that the reasons for hmm_mirror and hmm
was for:
1) Maybe define a common API for userspace to provide memory
   placement hints (NUMA for GPU)
2) multi-devices sharing same mirror page table

But support for multi-GPU in nouveau is way behind and i guess such
optimization will have to re-materialize what is necessary once that
happens.

Note this patch should also update kernel/fork.c and the mm_struct
definition AFAICT. With those changes you can add my:

Reviewed-by: Jérôme Glisse 

> 
> Signed-off-by: Jason Gunthorpe 
> ---
>  Documentation/vm/hmm.rst | 105 ---
>  include/linux/hmm.h  | 183 +
>  mm/Kconfig   |   1 -
>  mm/hmm.c | 284 +--
>  4 files changed, 33 insertions(+), 540 deletions(-)
> 
> diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
> index 0a5960beccf76d..a247643035c4e2 100644
> --- a/Documentation/vm/hmm.rst
> +++ b/Documentation/vm/hmm.rst
> @@ -147,49 +147,16 @@ Address space mirroring implementation and API
>  Address space mirroring's main objective is to allow duplication of a range 
> of
>  CPU page table into a device page table; HMM helps keep both synchronized. A
>  device driver that wants to mirror a process address space must start with 
> the
> -registration of an hmm_mirror struct::
> -
> - int hmm_mirror_register(struct hmm_mirror *mirror,
> - struct mm_struct *mm);
> -
> -The mirror struct has a set of callbacks that are used
> -to propagate CPU page tables::
> -
> - struct hmm_mirror_ops {
> - /* release() - release hmm_mirror
> -  *
> -  * @mirror: pointer to struct hmm_mirror
> -  *
> -  * This is called when the mm_struct is being released.  The callback
> -  * must ensure that all access to any pages obtained from this mirror
> -  * is halted before the callback returns. All future access should
> -  * fault.
> -  */
> - void (*release)(struct hmm_mirror *mirror);
> -
> - /* sync_cpu_device_pagetables() - synchronize page tables
> -  *
> -  * @mirror: pointer to struct hmm_mirror
> -  * @update: update information (see struct mmu_notifier_range)
> -  * Return: -EAGAIN if update.blockable false and callback need to
> -  * block, 0 otherwise.
> -  *
> -  * This callback ultimately originates from mmu_notifiers when the CPU
> -  * page table is updated. The device driver must update its page table
> -  * in response to this callback. The update argument tells what action
> -  * to perform.
> -  *
> -  * The device driver must not return from this callback until the device
> -  * page tables are completely updated (TLBs flushed, etc); this is a
> -  * synchronous call.
> -  */
> - int (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
> -   const struct hmm_update *update);
> - };
> -
> -The device driver must perform the update action to the range (mark range
> -read only, or fully unmap, etc.). The device must complete the update before
> -the driver callback returns.
> +registration of a mmu_range_notifier::
> +
> + mrn->ops = _ops;
> + int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> +   unsigned long start, unsigned long length,
> +   struct mm_struct *mm);
> +
> +During the driver_ops->invalidate() callback the device driver must perform
> +the update action to the range (mark range read only, or fully unmap,
> +etc.). The device must complete the update before the driver callback 
> returns.
>  
>  When the device driver wants to populate a range of virtual addresses, it can
>  use::
> @@ -216,70 +183,46 @@ The usage pattern is::
>struct hmm_range range;
>...
>  
> +  range.notifier = 
>range.start = ...;
>range.end = ...;
>range.pfns = ...;
>range.flags = ...;
>range.values = ...;
>range.pfn_shift = ...;
> -  hmm_range_register(, mirror);
>  
> -  /*
> -   * Just wait for range to be valid, safe to ignore return value as we
> -   * will use the return value of hmm_range_fault() below under the
> -   * mmap_sem to ascertain the validity of the range.
> -   */
> -  hmm_range_wait_until_valid(, TIMEOUT_IN_MSEC);
> +  if (!mmget_not_zero(mrn->notifier.mm))
> +  return -EFAULT;
>  
>   again:
> +  range.notifier_seq = mmu_range_read_begin();
>down_read(>mmap_sem);
>ret = hmm_range_fault(, HMM_RANGE_SNAPSHOT);
>if (ret) {
>up_read(>mmap_sem);
> -  if (ret == -EBUSY) {
> -/*
> - * No need to 

Re: [PATCH hmm 03/15] mm/hmm: allow hmm_range to be used with a mmu_range_notifier or hmm_mirror

2019-10-21 Thread Jerome Glisse
On Tue, Oct 15, 2019 at 03:12:30PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> hmm_mirror's handling of ranges does not use a sequence count which
> results in this bug:
> 
>  CPU0   CPU1
>  hmm_range_wait_until_valid(range)
>  valid == true
>  hmm_range_fault(range)
> hmm_invalidate_range_start()
>range->valid = false
> hmm_invalidate_range_end()
>range->valid = true
>  hmm_range_valid(range)
>   valid == true
> 
> Where the hmm_range_valid should not have succeeded.
> 
> Adding the required sequence count would make it nearly identical to the
> new mmu_range_notifier. Instead replace the hmm_mirror stuff with
> mmu_range_notifier.
> 
> Co-existence of the two APIs is the first step.
> 
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/hmm.h |  5 +
>  mm/hmm.c| 25 +++--
>  2 files changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 3fec513b9c00f1..8ac1fd6a81af8f 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -145,6 +145,9 @@ enum hmm_pfn_value_e {
>  /*
>   * struct hmm_range - track invalidation lock on virtual address range
>   *
> + * @notifier: an optional mmu_range_notifier
> + * @notifier_seq: when notifier is used this is the result of
> + *mmu_range_read_begin()
>   * @hmm: the core HMM structure this range is active against
>   * @vma: the vm area struct for the range
>   * @list: all range lock are on a list
> @@ -159,6 +162,8 @@ enum hmm_pfn_value_e {
>   * @valid: pfns array did not change since it has been fill by an HMM 
> function
>   */
>  struct hmm_range {
> + struct mmu_range_notifier *notifier;
> + unsigned long   notifier_seq;
>   struct hmm  *hmm;
>   struct list_headlist;
>   unsigned long   start;
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 902f5fa6bf93ad..22ac3595771feb 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -852,6 +852,14 @@ void hmm_range_unregister(struct hmm_range *range)
>  }
>  EXPORT_SYMBOL(hmm_range_unregister);
>  
> +static bool needs_retry(struct hmm_range *range)
> +{
> + if (range->notifier)
> + return mmu_range_check_retry(range->notifier,
> +  range->notifier_seq);
> + return !range->valid;
> +}
> +
>  static const struct mm_walk_ops hmm_walk_ops = {
>   .pud_entry  = hmm_vma_walk_pud,
>   .pmd_entry  = hmm_vma_walk_pmd,
> @@ -892,18 +900,23 @@ long hmm_range_fault(struct hmm_range *range, unsigned 
> int flags)
>   const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
>   unsigned long start = range->start, end;
>   struct hmm_vma_walk hmm_vma_walk;
> - struct hmm *hmm = range->hmm;
> + struct mm_struct *mm;
>   struct vm_area_struct *vma;
>   int ret;
>  
> - lockdep_assert_held(>mmu_notifier.mm->mmap_sem);
> + if (range->notifier)
> + mm = range->notifier->mm;
> + else
> + mm = range->hmm->mmu_notifier.mm;
> +
> + lockdep_assert_held(>mmap_sem);
>  
>   do {
>   /* If range is no longer valid force retry. */
> - if (!range->valid)
> + if (needs_retry(range))
>   return -EBUSY;
>  
> - vma = find_vma(hmm->mmu_notifier.mm, start);
> + vma = find_vma(mm, start);
>   if (vma == NULL || (vma->vm_flags & device_vma))
>   return -EFAULT;
>  
> @@ -933,7 +946,7 @@ long hmm_range_fault(struct hmm_range *range, unsigned 
> int flags)
>   start = hmm_vma_walk.last;
>  
>   /* Keep trying while the range is valid. */
> - } while (ret == -EBUSY && range->valid);
> + } while (ret == -EBUSY && !needs_retry(range));
>  
>   if (ret) {
>   unsigned long i;
> @@ -991,7 +1004,7 @@ long hmm_range_dma_map(struct hmm_range *range, struct 
> device *device,
>   continue;
>  
>   /* Check if range is being invalidated */
> - if (!range->valid) {
> + if (needs_retry(range)) {
>   ret = -EBUSY;
>   goto unmap;
>   }
> -- 
> 2.23.0
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH hmm 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled

2019-10-21 Thread Jerome Glisse
On Tue, Oct 15, 2019 at 03:12:28PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> Now that we have KERNEL_HEADER_TEST all headers are generally compile
> tested, so relying on makefile tricks to avoid compiling code that depends
> on CONFIG_MMU_NOTIFIER is more annoying.
> 
> Instead follow the usual pattern and provide most of the header with only
> the functions stubbed out when CONFIG_MMU_NOTIFIER is disabled. This
> ensures code compiles no matter what the config setting is.
> 
> While here, struct mmu_notifier_mm is private to mmu_notifier.c, move it.
> 
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/mmu_notifier.h | 46 +---
>  mm/mmu_notifier.c| 13 ++
>  2 files changed, 30 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 1bd8e6a09a3c27..12bd603d318ce7 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -7,8 +7,9 @@
>  #include 
>  #include 
>  
> +struct mmu_notifier_mm;
>  struct mmu_notifier;
> -struct mmu_notifier_ops;
> +struct mmu_notifier_range;
>  
>  /**
>   * enum mmu_notifier_event - reason for the mmu notifier callback
> @@ -40,36 +41,8 @@ enum mmu_notifier_event {
>   MMU_NOTIFY_SOFT_DIRTY,
>  };
>  
> -#ifdef CONFIG_MMU_NOTIFIER
> -
> -#ifdef CONFIG_LOCKDEP
> -extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
> -#endif
> -
> -/*
> - * The mmu notifier_mm structure is allocated and installed in
> - * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
> - * critical section and it's released only when mm_count reaches zero
> - * in mmdrop().
> - */
> -struct mmu_notifier_mm {
> - /* all mmu notifiers registerd in this mm are queued in this list */
> - struct hlist_head list;
> - /* to serialize the list modifications and hlist_unhashed */
> - spinlock_t lock;
> -};
> -
>  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
>  
> -struct mmu_notifier_range {
> - struct vm_area_struct *vma;
> - struct mm_struct *mm;
> - unsigned long start;
> - unsigned long end;
> - unsigned flags;
> - enum mmu_notifier_event event;
> -};
> -
>  struct mmu_notifier_ops {
>   /*
>* Called either by mmu_notifier_unregister or when the mm is
> @@ -249,6 +222,21 @@ struct mmu_notifier {
>   unsigned int users;
>  };
>  
> +#ifdef CONFIG_MMU_NOTIFIER
> +
> +#ifdef CONFIG_LOCKDEP
> +extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
> +#endif
> +
> +struct mmu_notifier_range {
> + struct vm_area_struct *vma;
> + struct mm_struct *mm;
> + unsigned long start;
> + unsigned long end;
> + unsigned flags;
> + enum mmu_notifier_event event;
> +};
> +
>  static inline int mm_has_notifiers(struct mm_struct *mm)
>  {
>   return unlikely(mm->mmu_notifier_mm);
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 7fde88695f35d6..367670cfd02b7b 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -27,6 +27,19 @@ struct lockdep_map 
> __mmu_notifier_invalidate_range_start_map = {
>  };
>  #endif
>  
> +/*
> + * The mmu notifier_mm structure is allocated and installed in
> + * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
> + * critical section and it's released only when mm_count reaches zero
> + * in mmdrop().
> + */
> +struct mmu_notifier_mm {
> + /* all mmu notifiers registered in this mm are queued in this list */
> + struct hlist_head list;
> + /* to serialize the list modifications and hlist_unhashed */
> + spinlock_t lock;
> +};
> +
>  /*
>   * This function can't run concurrently against mmu_notifier_register
>   * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
> -- 
> 2.23.0
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH hmm 02/15] mm/mmu_notifier: add an interval tree notifier

2019-10-21 Thread Jerome Glisse
On Tue, Oct 15, 2019 at 03:12:29PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> Of the 13 users of mmu_notifiers, 8 of them use only
> invalidate_range_start/end() and immediately intersect the
> mmu_notifier_range with some kind of internal list of VAs.  4 use an
> interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
> of some kind (scif_dma, vhost, gntdev, hmm)
> 
> And the remaining 5 either don't use invalidate_range_start() or do some
> special thing with it.
> 
> It turns out that building a correct scheme with an interval tree is
> pretty complicated, particularly if the use case is synchronizing against
> another thread doing get_user_pages().  Many of these implementations have
> various subtle and difficult to fix races.
> 
> This approach puts the interval tree as common code at the top of the mmu
> notifier call tree and implements a shareable locking scheme.
> 
> It includes:
>  - An interval tree tracking VA ranges, with per-range callbacks
>  - A read/write locking scheme for the interval tree that avoids
>sleeping in the notifier path (for OOM killer)
>  - A sequence counter based collision-retry locking scheme to tell
>device page fault that a VA range is being concurrently invalidated.
> 
> This is based on various ideas:
> - hmm accumulates invalidated VA ranges and releases them when all
>   invalidates are done, via active_invalidate_ranges count.
>   This approach avoids having to intersect the interval tree twice (as
>   umem_odp does) at the potential cost of a longer device page fault.
> 
> - kvm/umem_odp use a sequence counter to drive the collision retry,
>   via invalidate_seq
> 
> - a deferred work todo list on unlock scheme like RTNL, via deferred_list.
>   This makes adding/removing interval tree members more deterministic
> 
> - seqlock, except this version makes the seqlock idea multi-holder on the
>   write side by protecting it with active_invalidate_ranges and a spinlock
> 
> To minimize MM overhead when only the interval tree is being used, the
> entire SRCU and hlist overheads are dropped using some simple
> branches. Similarly the interval tree overhead is dropped when in hlist
> mode.
> 
> The overhead from the mandatory spinlock is broadly the same as most of
> existing users which already had a lock (or two) of some sort on the
> invalidation path.
> 
> Cc: Andrea Arcangeli 
> Cc: Michal Hocko 
> Signed-off-by: Jason Gunthorpe 
> ---
>  include/linux/mmu_notifier.h |  78 ++
>  mm/Kconfig   |   1 +
>  mm/mmu_notifier.c| 529 +--
>  3 files changed, 583 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 12bd603d318ce7..bc2b12483de127 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -6,10 +6,12 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct mmu_notifier_mm;
>  struct mmu_notifier;
>  struct mmu_notifier_range;
> +struct mmu_range_notifier;
>  
>  /**
>   * enum mmu_notifier_event - reason for the mmu notifier callback
> @@ -32,6 +34,9 @@ struct mmu_notifier_range;
>   * access flags). User should soft dirty the page in the end callback to make
>   * sure that anyone relying on soft dirtyness catch pages that might be 
> written
>   * through non CPU mappings.
> + *
> + * @MMU_NOTIFY_RELEASE: used during mmu_range_notifier invalidate to signal 
> that
> + * the mm refcount is zero and the range is no longer accessible.
>   */
>  enum mmu_notifier_event {
>   MMU_NOTIFY_UNMAP = 0,
> @@ -39,6 +44,7 @@ enum mmu_notifier_event {
>   MMU_NOTIFY_PROTECTION_VMA,
>   MMU_NOTIFY_PROTECTION_PAGE,
>   MMU_NOTIFY_SOFT_DIRTY,
> + MMU_NOTIFY_RELEASE,
>  };
>  
>  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> @@ -222,6 +228,25 @@ struct mmu_notifier {
>   unsigned int users;
>  };
>  
> +/**
> + * struct mmu_range_notifier_ops
> + * @invalidate: Upon return the caller must stop using any SPTEs within this
> + *  range, this function can sleep. Return false if blocking was
> + *  required but range is non-blocking
> + */
> +struct mmu_range_notifier_ops {
> + bool (*invalidate)(struct mmu_range_notifier *mrn,
> +const struct mmu_notifier_range *range);
> +};
> +
> +struct mmu_range_notifier {
> + struct interval_tree_node interval_tree;
> + const struct mmu_range_notifier_ops *ops;
> + struct hlist_node deferred_item;
> + unsigned long invalidate_seq;
> + struct mm_struct *mm;
> +};
> +
>  #ifdef CONFIG_MMU_NOTIFIER
>  
>  #ifdef CONFIG_LOCKDEP
> @@ -263,6 +288,59 @@ extern int __mmu_notifier_register(struct mmu_notifier 
> *mn,
>  struct mm_struct *mm);
>  extern void mmu_notifier_unregister(struct mmu_notifier *mn,
>   struct mm_struct *mm);
> +
> +unsigned long mmu_range_read_begin(struct 

Re: [PATCH hmm 04/15] mm/hmm: define the pre-processor related parts of hmm.h even if disabled

2019-10-21 Thread Jerome Glisse
On Tue, Oct 15, 2019 at 03:12:31PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> Only the function calls are stubbed out with static inlines that always
> fail. This is the standard way to write a header for an optional component
> and makes it easier for drivers that only optionally need HMM_MIRROR.
> 
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/hmm.h | 59 -
>  kernel/fork.c   |  1 -
>  2 files changed, 47 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 8ac1fd6a81af8f..2666eb08a40615 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -62,8 +62,6 @@
>  #include 
>  #include 
>  
> -#ifdef CONFIG_HMM_MIRROR
> -
>  #include 
>  #include 
>  #include 
> @@ -374,6 +372,15 @@ struct hmm_mirror {
>   struct list_headlist;
>  };
>  
> +/*
> + * Retry fault if non-blocking, drop mmap_sem and return -EAGAIN in that 
> case.
> + */
> +#define HMM_FAULT_ALLOW_RETRY(1 << 0)
> +
> +/* Don't fault in missing PTEs, just snapshot the current state. */
> +#define HMM_FAULT_SNAPSHOT   (1 << 1)
> +
> +#ifdef CONFIG_HMM_MIRROR
>  int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
>  void hmm_mirror_unregister(struct hmm_mirror *mirror);
>  
> @@ -383,14 +390,6 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
>  int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror);
>  void hmm_range_unregister(struct hmm_range *range);
>  
> -/*
> - * Retry fault if non-blocking, drop mmap_sem and return -EAGAIN in that 
> case.
> - */
> -#define HMM_FAULT_ALLOW_RETRY(1 << 0)
> -
> -/* Don't fault in missing PTEs, just snapshot the current state. */
> -#define HMM_FAULT_SNAPSHOT   (1 << 1)
> -
>  long hmm_range_fault(struct hmm_range *range, unsigned int flags);
>  
>  long hmm_range_dma_map(struct hmm_range *range,
> @@ -401,6 +400,44 @@ long hmm_range_dma_unmap(struct hmm_range *range,
>struct device *device,
>dma_addr_t *daddrs,
>bool dirty);
> +#else
> +int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +void hmm_mirror_unregister(struct hmm_mirror *mirror)
> +{
> +}
> +
> +int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +void hmm_range_unregister(struct hmm_range *range)
> +{
> +}
> +
> +static inline long hmm_range_fault(struct hmm_range *range, unsigned int 
> flags)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static inline long hmm_range_dma_map(struct hmm_range *range,
> +  struct device *device, dma_addr_t *daddrs,
> +  unsigned int flags)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static inline long hmm_range_dma_unmap(struct hmm_range *range,
> +struct device *device,
> +dma_addr_t *daddrs, bool dirty)
> +{
> + return -EOPNOTSUPP;
> +}
> +#endif
>  
>  /*
>   * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
> @@ -411,6 +448,4 @@ long hmm_range_dma_unmap(struct hmm_range *range,
>   */
>  #define HMM_RANGE_DEFAULT_TIMEOUT 1000
>  
> -#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
> -
>  #endif /* LINUX_HMM_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f9572f41612628..4561a65d19db88 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -40,7 +40,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> -- 
> 2.23.0
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] mm/hmm: hmm_range_fault handle pages swapped out

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 08:52:56PM +, Yang, Philip wrote:
> hmm_range_fault may return NULL pages because some of pfns are equal to
> HMM_PFN_NONE. This happens randomly under memory pressure. The reason is
> for swapped out page pte path, hmm_vma_handle_pte doesn't update fault
> variable from cpu_flags, so it failed to call hmm_vam_do_fault to swap
> the page in.
> 
> The fix is to call hmm_pte_need_fault to update fault variable.
> 
> Change-Id: I2e8611485563d11d938881c18b7935fa1e7c91ee
> Signed-off-by: Philip Yang 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/hmm.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 9f22562e2c43..7ca4fb39d3d8 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -544,6 +544,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
> unsigned long addr,
>   swp_entry_t entry = pte_to_swp_entry(pte);
>  
>   if (!non_swap_entry(entry)) {
> + cpu_flags = pte_to_hmm_pfn_flags(range, pte);
> + hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
> +, _fault);
>   if (fault || write_fault)
>   goto fault;
>   return 0;
> -- 
> 2.17.1
> 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 08:41:33PM +, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 04:33:06PM -0400, Jerome Glisse wrote:
> 
> > So nor HMM nor driver should dereference the struct page (i do not
> > think any iommu driver would either),
> 
> Er, they do technically deref the struct page:
> 
> nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
>struct hmm_range *range)
>   struct page *page;
>   page = hmm_pfn_to_page(range, range->pfns[i]);
>   if (!nouveau_dmem_page(drm, page)) {
> 
> 
> nouveau_dmem_page(struct nouveau_drm *drm, struct page *page)
> {
>   return is_device_private_page(page) && drm->dmem == page_to_dmem(page)
> 
> 
> Which does touch 'page->pgmap'
> 
> Is this OK without having a get_dev_pagemap() ?
>
> Noting that the collision-retry scheme doesn't protect anything here
> as we can have a concurrent invalidation while doing the above deref.

Uh ? How so ? We are not reading the same code i think.

My read is that function is call when holding the device
lock which exclude any racing mmu notifier from making
forward progress and it is also protected by the range so
at the time this happens it is safe to dereference the
struct page. In this case any way we can update the
nouveau_dmem_page() to check that page page->pgmap == the
expected pgmap.

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 01:12:22PM -0700, Dan Williams wrote:
> On Thu, Aug 15, 2019 at 12:44 PM Jerome Glisse  wrote:
> >
> > On Thu, Aug 15, 2019 at 12:36:58PM -0700, Dan Williams wrote:
> > > On Thu, Aug 15, 2019 at 11:07 AM Jerome Glisse  wrote:
> > > >
> > > > On Wed, Aug 14, 2019 at 07:48:28AM -0700, Dan Williams wrote:
> > > > > On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe  
> > > > > wrote:
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote:
> > > > > > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> > > > > > > > Section alignment constraints somewhat save us here. The only 
> > > > > > > > example
> > > > > > > > I can think of a PMD not containing a uniform pgmap association 
> > > > > > > > for
> > > > > > > > each pte is the case when the pgmap overlaps normal dram, i.e. 
> > > > > > > > shares
> > > > > > > > the same 'struct memory_section' for a given span. Otherwise, 
> > > > > > > > distinct
> > > > > > > > pgmaps arrange to manage their own exclusive sections (and now
> > > > > > > > subsections as of v5.3). Otherwise the implementation could not
> > > > > > > > guarantee different mapping lifetimes.
> > > > > > > >
> > > > > > > > That said, this seems to want a better mechanism to determine 
> > > > > > > > "pfn is
> > > > > > > > ZONE_DEVICE".
> > > > > > >
> > > > > > > So I guess this patch is fine for now, and once you provide a 
> > > > > > > better
> > > > > > > mechanism we can switch over to it?
> > > > > >
> > > > > > What about the version I sent to just get rid of all the strange
> > > > > > put_dev_pagemaps while scanning? Odds are good we will work with 
> > > > > > only
> > > > > > a single pagemap, so it makes some sense to cache it once we find 
> > > > > > it?
> > > > >
> > > > > Yes, if the scan is over a single pmd then caching it makes sense.
> > > >
> > > > Quite frankly an easier an better solution is to remove the pagemap
> > > > lookup as HMM user abide by mmu notifier it means we will not make
> > > > use or dereference the struct page so that we are safe from any
> > > > racing hotunplug of dax memory (as long as device driver using hmm
> > > > do not have a bug).
> > >
> > > Yes, as long as the driver remove is synchronized against HMM
> > > operations via another mechanism then there is no need to take pagemap
> > > references. Can you briefly describe what that other mechanism is?
> >
> > So if you hotunplug some dax memory i assume that this can only
> > happens once all the pages are unmapped (as it must have the
> > zero refcount, well 1 because of the bias) and any unmap will
> > trigger a mmu notifier callback. User of hmm mirror abiding by
> > the API will never make use of information they get through the
> > fault or snapshot function until checking for racing notifier
> > under lock.
> 
> Hmm that first assumption is not guaranteed by the dev_pagemap core.
> The dev_pagemap end of life model is "disable, invalidate, drain" so
> it's possible to call devm_munmap_pages() while pages are still mapped
> it just won't complete the teardown of the pagemap until the last
> reference is dropped. New references are blocked during this teardown.
> 
> However, if the driver is validating the liveness of the mapping in
> the mmu-notifier path and blocking new references it sounds like it
> should be ok. Might there be GPU driver unit tests that cover this
> racing teardown case?

So nor HMM nor driver should dereference the struct page (i do not
think any iommu driver would either), they only care about the pfn.
So even if we race with a teardown as soon as we get the mmu notifier
callback to invalidate the mmapping we will do so. The pattern is:

mydriver_populate_vaddr_range(start, end) {
hmm_range_register(range, start, end)
again:
ret = hmm_range_fault(start, end)
if (ret < 0)
return ret;

take_driver_page_table_lock();
if (range.valid) {
populate_device_page_table();
release_driver_page_table

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 12:36:58PM -0700, Dan Williams wrote:
> On Thu, Aug 15, 2019 at 11:07 AM Jerome Glisse  wrote:
> >
> > On Wed, Aug 14, 2019 at 07:48:28AM -0700, Dan Williams wrote:
> > > On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe  wrote:
> > > >
> > > > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote:
> > > > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> > > > > > Section alignment constraints somewhat save us here. The only 
> > > > > > example
> > > > > > I can think of a PMD not containing a uniform pgmap association for
> > > > > > each pte is the case when the pgmap overlaps normal dram, i.e. 
> > > > > > shares
> > > > > > the same 'struct memory_section' for a given span. Otherwise, 
> > > > > > distinct
> > > > > > pgmaps arrange to manage their own exclusive sections (and now
> > > > > > subsections as of v5.3). Otherwise the implementation could not
> > > > > > guarantee different mapping lifetimes.
> > > > > >
> > > > > > That said, this seems to want a better mechanism to determine "pfn 
> > > > > > is
> > > > > > ZONE_DEVICE".
> > > > >
> > > > > So I guess this patch is fine for now, and once you provide a better
> > > > > mechanism we can switch over to it?
> > > >
> > > > What about the version I sent to just get rid of all the strange
> > > > put_dev_pagemaps while scanning? Odds are good we will work with only
> > > > a single pagemap, so it makes some sense to cache it once we find it?
> > >
> > > Yes, if the scan is over a single pmd then caching it makes sense.
> >
> > Quite frankly an easier an better solution is to remove the pagemap
> > lookup as HMM user abide by mmu notifier it means we will not make
> > use or dereference the struct page so that we are safe from any
> > racing hotunplug of dax memory (as long as device driver using hmm
> > do not have a bug).
> 
> Yes, as long as the driver remove is synchronized against HMM
> operations via another mechanism then there is no need to take pagemap
> references. Can you briefly describe what that other mechanism is?

So if you hotunplug some dax memory i assume that this can only
happens once all the pages are unmapped (as it must have the
zero refcount, well 1 because of the bias) and any unmap will
trigger a mmu notifier callback. User of hmm mirror abiding by
the API will never make use of information they get through the
fault or snapshot function until checking for racing notifier
under lock.

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Jerome Glisse
On Wed, Aug 14, 2019 at 07:48:28AM -0700, Dan Williams wrote:
> On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe  wrote:
> >
> > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote:
> > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> > > > Section alignment constraints somewhat save us here. The only example
> > > > I can think of a PMD not containing a uniform pgmap association for
> > > > each pte is the case when the pgmap overlaps normal dram, i.e. shares
> > > > the same 'struct memory_section' for a given span. Otherwise, distinct
> > > > pgmaps arrange to manage their own exclusive sections (and now
> > > > subsections as of v5.3). Otherwise the implementation could not
> > > > guarantee different mapping lifetimes.
> > > >
> > > > That said, this seems to want a better mechanism to determine "pfn is
> > > > ZONE_DEVICE".
> > >
> > > So I guess this patch is fine for now, and once you provide a better
> > > mechanism we can switch over to it?
> >
> > What about the version I sent to just get rid of all the strange
> > put_dev_pagemaps while scanning? Odds are good we will work with only
> > a single pagemap, so it makes some sense to cache it once we find it?
> 
> Yes, if the scan is over a single pmd then caching it makes sense.

Quite frankly an easier an better solution is to remove the pagemap
lookup as HMM user abide by mmu notifier it means we will not make
use or dereference the struct page so that we are safe from any
racing hotunplug of dax memory (as long as device driver using hmm
do not have a bug).

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [Nouveau] [PATCH] nouveau/hmm: map pages after migration

2019-08-15 Thread Jerome Glisse
On Tue, Aug 13, 2019 at 05:58:52PM -0400, Jerome Glisse wrote:
> On Wed, Aug 07, 2019 at 08:02:14AM -0700, Ralph Campbell wrote:
> > When memory is migrated to the GPU it is likely to be accessed by GPU
> > code soon afterwards. Instead of waiting for a GPU fault, map the
> > migrated memory into the GPU page tables with the same access permissions
> > as the source CPU page table entries. This preserves copy on write
> > semantics.
> > 
> > Signed-off-by: Ralph Campbell 
> > Cc: Christoph Hellwig 
> > Cc: Jason Gunthorpe 
> > Cc: "Jérôme Glisse" 
> > Cc: Ben Skeggs 
> 
> Sorry for delay i am swamp, couple issues:
> - nouveau_pfns_map() is never call, it should be call after
>   the dma copy is done (iirc it is lacking proper fencing
>   so that would need to be implemented first)
> 
> - the migrate ioctl is disconnected from the svm part and
>   thus we would need first to implement svm reference counting
>   and take a reference at begining of migration process and
>   release it at the end ie struct nouveau_svmm needs refcounting
>   of some sort. I let Ben decides what he likes best for that.

Thinking more about that the svm lifetime is bound to the file
descriptor on the device driver file held by the process. So
when you call migrate ioctl the svm should not go away because
you are in an ioctl against the file descriptor. I need to double
check all that in respect of process that open the device file
multiple time with different file descriptor (or fork thing and
all).


> - i rather not have an extra pfns array, i am pretty sure we
>   can directly feed what we get from the dma array to the svm
>   code to update the GPU page table
> 
> Observation that can be delayed to latter patches:
> - i do not think we want to map directly if the dma engine
>   is queue up and thus if the fence will take time to signal
> 
>   This is why i did not implement this in the first place.
>   Maybe using a workqueue to schedule a pre-fill of the GPU
>   page table and wakeup the workqueue with the fence notify
>   event.
> 
> 
> > ---
> > 
> > This patch is based on top of Christoph Hellwig's 9 patch series
> > https://lore.kernel.org/linux-mm/20190729234611.gc7...@redhat.com/T/#u
> > "turn the hmm migrate_vma upside down" but without patch 9
> > "mm: remove the unused MIGRATE_PFN_WRITE" and adds a use for the flag.
> > 
> > 
> >  drivers/gpu/drm/nouveau/nouveau_dmem.c | 45 +-
> >  drivers/gpu/drm/nouveau/nouveau_svm.c  | 86 ++
> >  drivers/gpu/drm/nouveau/nouveau_svm.h  | 19 ++
> >  3 files changed, 133 insertions(+), 17 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
> > b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > index ef9de82b0744..c83e6f118817 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > @@ -25,11 +25,13 @@
> >  #include "nouveau_dma.h"
> >  #include "nouveau_mem.h"
> >  #include "nouveau_bo.h"
> > +#include "nouveau_svm.h"
> >  
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -560,11 +562,12 @@ nouveau_dmem_init(struct nouveau_drm *drm)
> >  }
> >  
> >  static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
> > -   struct vm_area_struct *vma, unsigned long addr,
> > -   unsigned long src, dma_addr_t *dma_addr)
> > +   struct vm_area_struct *vma, unsigned long src,
> > +   dma_addr_t *dma_addr, u64 *pfn)
> >  {
> > struct device *dev = drm->dev->dev;
> > struct page *dpage, *spage;
> > +   unsigned long paddr;
> >  
> > spage = migrate_pfn_to_page(src);
> > if (!spage || !(src & MIGRATE_PFN_MIGRATE))
> > @@ -572,17 +575,21 @@ static unsigned long 
> > nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
> >  
> > dpage = nouveau_dmem_page_alloc_locked(drm);
> > if (!dpage)
> > -   return 0;
> > +   goto out;
> >  
> > *dma_addr = dma_map_page(dev, spage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
> > if (dma_mapping_error(dev, *dma_addr))
> > goto out_free_page;
> >  
> > +   paddr = nouveau_dmem_page_addr(dpage);
> > if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_VRAM,
> > -   nouveau_dmem_page_addr(dpage)

Re: [PATCH] nouveau/hmm: map pages after migration

2019-08-13 Thread Jerome Glisse
On Wed, Aug 07, 2019 at 08:02:14AM -0700, Ralph Campbell wrote:
> When memory is migrated to the GPU it is likely to be accessed by GPU
> code soon afterwards. Instead of waiting for a GPU fault, map the
> migrated memory into the GPU page tables with the same access permissions
> as the source CPU page table entries. This preserves copy on write
> semantics.
> 
> Signed-off-by: Ralph Campbell 
> Cc: Christoph Hellwig 
> Cc: Jason Gunthorpe 
> Cc: "Jérôme Glisse" 
> Cc: Ben Skeggs 

Sorry for delay i am swamp, couple issues:
- nouveau_pfns_map() is never call, it should be call after
  the dma copy is done (iirc it is lacking proper fencing
  so that would need to be implemented first)

- the migrate ioctl is disconnected from the svm part and
  thus we would need first to implement svm reference counting
  and take a reference at begining of migration process and
  release it at the end ie struct nouveau_svmm needs refcounting
  of some sort. I let Ben decides what he likes best for that.

- i rather not have an extra pfns array, i am pretty sure we
  can directly feed what we get from the dma array to the svm
  code to update the GPU page table

Observation that can be delayed to latter patches:
- i do not think we want to map directly if the dma engine
  is queue up and thus if the fence will take time to signal

  This is why i did not implement this in the first place.
  Maybe using a workqueue to schedule a pre-fill of the GPU
  page table and wakeup the workqueue with the fence notify
  event.


> ---
> 
> This patch is based on top of Christoph Hellwig's 9 patch series
> https://lore.kernel.org/linux-mm/20190729234611.gc7...@redhat.com/T/#u
> "turn the hmm migrate_vma upside down" but without patch 9
> "mm: remove the unused MIGRATE_PFN_WRITE" and adds a use for the flag.
> 
> 
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 45 +-
>  drivers/gpu/drm/nouveau/nouveau_svm.c  | 86 ++
>  drivers/gpu/drm/nouveau/nouveau_svm.h  | 19 ++
>  3 files changed, 133 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
> b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index ef9de82b0744..c83e6f118817 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -25,11 +25,13 @@
>  #include "nouveau_dma.h"
>  #include "nouveau_mem.h"
>  #include "nouveau_bo.h"
> +#include "nouveau_svm.h"
>  
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -560,11 +562,12 @@ nouveau_dmem_init(struct nouveau_drm *drm)
>  }
>  
>  static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
> - struct vm_area_struct *vma, unsigned long addr,
> - unsigned long src, dma_addr_t *dma_addr)
> + struct vm_area_struct *vma, unsigned long src,
> + dma_addr_t *dma_addr, u64 *pfn)
>  {
>   struct device *dev = drm->dev->dev;
>   struct page *dpage, *spage;
> + unsigned long paddr;
>  
>   spage = migrate_pfn_to_page(src);
>   if (!spage || !(src & MIGRATE_PFN_MIGRATE))
> @@ -572,17 +575,21 @@ static unsigned long 
> nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>  
>   dpage = nouveau_dmem_page_alloc_locked(drm);
>   if (!dpage)
> - return 0;
> + goto out;
>  
>   *dma_addr = dma_map_page(dev, spage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
>   if (dma_mapping_error(dev, *dma_addr))
>   goto out_free_page;
>  
> + paddr = nouveau_dmem_page_addr(dpage);
>   if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_VRAM,
> - nouveau_dmem_page_addr(dpage), NOUVEAU_APER_HOST,
> - *dma_addr))
> + paddr, NOUVEAU_APER_HOST, *dma_addr))
>   goto out_dma_unmap;
>  
> + *pfn = NVIF_VMM_PFNMAP_V0_V | NVIF_VMM_PFNMAP_V0_VRAM |
> + ((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
> + if (src & MIGRATE_PFN_WRITE)
> + *pfn |= NVIF_VMM_PFNMAP_V0_W;
>   return migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
>  
>  out_dma_unmap:
> @@ -590,18 +597,19 @@ static unsigned long 
> nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>  out_free_page:
>   nouveau_dmem_page_free_locked(drm, dpage);
>  out:
> + *pfn = NVIF_VMM_PFNMAP_V0_NONE;
>   return 0;
>  }
>  
>  static void nouveau_dmem_migrate_chunk(struct migrate_vma *args,
> - struct nouveau_drm *drm, dma_addr_t *dma_addrs)
> + struct nouveau_drm *drm, dma_addr_t *dma_addrs, u64 *pfns)
>  {
>   struct nouveau_fence *fence;
>   unsigned long addr = args->start, nr_dma = 0, i;
>  
>   for (i = 0; addr < args->end; i++) {
>   args->dst[i] = nouveau_dmem_migrate_copy_one(drm, args->vma,
> - addr, args->src[i], _addrs[nr_dma]);
> + 

Re: [PATCH 1/2] mm/hmm: support automatic NUMA balancing

2019-05-13 Thread Jerome Glisse
On Mon, May 13, 2019 at 02:27:20PM -0700, Andrew Morton wrote:
> On Fri, 10 May 2019 19:53:23 + "Kuehling, Felix"  
> wrote:
> 
> > From: Philip Yang 
> > 
> > While the page is migrating by NUMA balancing, HMM failed to detect this
> > condition and still return the old page. Application will use the new
> > page migrated, but driver pass the old page physical address to GPU,
> > this crash the application later.
> > 
> > Use pte_protnone(pte) to return this condition and then hmm_vma_do_fault
> > will allocate new page.
> > 
> > Signed-off-by: Philip Yang 
> 
> This should have included your signed-off-by:, since you were on the
> patch delivery path.  I'll make that change to my copy of the patch,
> OK?

Yes it should have included that.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 2/2] mm/hmm: Only set FAULT_FLAG_ALLOW_RETRY for non-blocking

2019-05-13 Thread Jerome Glisse
Andrew can we get this 2 fixes line up for 5.2 ?

On Mon, May 13, 2019 at 07:36:44PM +, Kuehling, Felix wrote:
> Hi Jerome,
> 
> Do you want me to push the patches to your branch? Or are you going to 
> apply them yourself?
> 
> Is your hmm-5.2-v3 branch going to make it into Linux 5.2? If so, do you 
> know when? I'd like to coordinate with Dave Airlie so that we can also 
> get that update into a drm-next branch soon.
> 
> I see that Linus merged Dave's pull request for Linux 5.2, which 
> includes the first changes in amdgpu using HMM. They're currently broken 
> without these two patches.

HMM patch do not go through any git branch they go through the mmotm
collection. So it is not something you can easily coordinate with drm
branch.

By broken i expect you mean that if numabalance happens it breaks ?
Or it might sleep when you are not expecting it too ?

Cheers,
Jérôme

> 
> Thanks,
>    Felix
> 
> On 2019-05-10 4:14 p.m., Jerome Glisse wrote:
> > [CAUTION: External Email]
> >
> > On Fri, May 10, 2019 at 07:53:24PM +, Kuehling, Felix wrote:
> >> Don't set this flag by default in hmm_vma_do_fault. It is set
> >> conditionally just a few lines below. Setting it unconditionally
> >> can lead to handle_mm_fault doing a non-blocking fault, returning
> >> -EBUSY and unlocking mmap_sem unexpectedly.
> >>
> >> Signed-off-by: Felix Kuehling 
> > Reviewed-by: Jérôme Glisse 
> >
> >> ---
> >>   mm/hmm.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/hmm.c b/mm/hmm.c
> >> index b65c27d5c119..3c4f1d62202f 100644
> >> --- a/mm/hmm.c
> >> +++ b/mm/hmm.c
> >> @@ -339,7 +339,7 @@ struct hmm_vma_walk {
> >>   static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
> >>bool write_fault, uint64_t *pfn)
> >>   {
> >> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
> >> + unsigned int flags = FAULT_FLAG_REMOTE;
> >>struct hmm_vma_walk *hmm_vma_walk = walk->private;
> >>struct hmm_range *range = hmm_vma_walk->range;
> >>struct vm_area_struct *vma = walk->vma;
> >> --
> >> 2.17.1
> >>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 2/2] mm/hmm: Only set FAULT_FLAG_ALLOW_RETRY for non-blocking

2019-05-10 Thread Jerome Glisse
On Fri, May 10, 2019 at 07:53:24PM +, Kuehling, Felix wrote:
> Don't set this flag by default in hmm_vma_do_fault. It is set
> conditionally just a few lines below. Setting it unconditionally
> can lead to handle_mm_fault doing a non-blocking fault, returning
> -EBUSY and unlocking mmap_sem unexpectedly.
> 
> Signed-off-by: Felix Kuehling 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/hmm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index b65c27d5c119..3c4f1d62202f 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -339,7 +339,7 @@ struct hmm_vma_walk {
>  static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
>   bool write_fault, uint64_t *pfn)
>  {
> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
> + unsigned int flags = FAULT_FLAG_REMOTE;
>   struct hmm_vma_walk *hmm_vma_walk = walk->private;
>   struct hmm_range *range = hmm_vma_walk->range;
>   struct vm_area_struct *vma = walk->vma;
> -- 
> 2.17.1
> 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 1/2] mm/hmm: support automatic NUMA balancing

2019-05-10 Thread Jerome Glisse
On Fri, May 10, 2019 at 07:53:23PM +, Kuehling, Felix wrote:
> From: Philip Yang 
> 
> While the page is migrating by NUMA balancing, HMM failed to detect this
> condition and still return the old page. Application will use the new
> page migrated, but driver pass the old page physical address to GPU,
> this crash the application later.
> 
> Use pte_protnone(pte) to return this condition and then hmm_vma_do_fault
> will allocate new page.
> 
> Signed-off-by: Philip Yang 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/hmm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 75d2ea906efb..b65c27d5c119 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -554,7 +554,7 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
>  
>  static inline uint64_t pte_to_hmm_pfn_flags(struct hmm_range *range, pte_t 
> pte)
>  {
> - if (pte_none(pte) || !pte_present(pte))
> + if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte))
>   return 0;
>   return pte_write(pte) ? range->flags[HMM_PFN_VALID] |
>   range->flags[HMM_PFN_WRITE] :
> -- 
> 2.17.1
> 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: KASAN caught amdgpu / HMM use-after-free

2019-02-27 Thread Jerome Glisse
On Wed, Feb 27, 2019 at 06:02:49PM +0100, Michel Dänzer wrote:
> 
> See the attached dmesg excerpt. I've hit this a few times running piglit
> with amd-staging-drm-next, first on February 22nd.
> 
> The memory was freed after calling hmm_mirror_unregister in
> amdgpu_mn_destroy.

So that branch is not using the HMM changes queue up for 5.1 and thus
what you are doing is somewhat illegal. In 5.1 changes all is refcounted
and this bug should not be able to happen. So if you rebase your work
on top of 

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-for-5.1

Or linux-next (i believe i saw those bits in linux-next) then this
error will vanish. Sorry if there was confusion between what is legal
now and what is legal tommorrow :)

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: select ARCH_HAS_HMM and ZONE_DEVICE option

2019-02-20 Thread Jerome Glisse
On Thu, Feb 21, 2019 at 12:17:33AM +, Kuehling, Felix wrote:
> On 2019-02-20 6:34 p.m., Jerome Glisse wrote:
> > On Wed, Feb 20, 2019 at 10:39:49PM +, Kuehling, Felix wrote:
> >> On 2019-02-20 5:12 p.m., Jerome Glisse wrote:
> >>> On Wed, Feb 20, 2019 at 07:18:17PM +, Kuehling, Felix wrote:
> >>>> [+Jerome]
> >>>>
> >>>> Why to we need ZONE_DEVICE. I didn't think this was needed for mirroring
> >>>> CPU page tables to device page tables.
> >>>>
> >>>> ARCH_HAS_HMM depends on (X86_64 || PPC64). Do we have some alternative
> >>>> for ARM support?
> >>>>
> >>>> Also, the name ARCH_HAS_HMM looks like it's meant to be selected by the
> >>>> CPU architecture rather than any driver. Jerome, do you have any advice?
> >>> This patch is wrong you need to depend on ARCH_HAS_HMM and
> >> Who selects ARCH_HAS_HMM? Currently I don't see this selected anywhere.
> >> So any config option that depends on it will be invisible in menuconfig.
> >> Do we need ARCH_HAS_HMM somewhere in the arch/x86/Kconfig and
> >> arch/powerpc/Kconfig?
> >>
> >> Also, ARCH_HAS_HMM does not currently support ARM. Does that mean we
> >> can't have ARM support in AMDGPU if we start using HMM?
> > ARCH_HAS_HMM is defined by architecture that support HMM. So par x86
> > and PPC. It should not be hard to add it to ARM (i can not remember if
> > ARM has DAX yet or not, if ARM does not have DAX then you need to add
> > that first).
> 
> Not having ARM support is a bummer. I just enabled KFD on ARM a few 
> weeks ago. Now depending on HMM makes KFD unusable on ARM. [+Mark FYI] I 
> hope this is only a temporary setback.

It should not be hard to add in fact all it might need is a Kconfig
patch. I have no easy access to ARM with PCIE so i have not tackle
this yet.

> 
> 
> >> Finally, ARCH_HAS_HMM has a bunch of dependencies. If they are not met,
> >> I guess it can't be enabled. Should those be "select"s instead?
> > No they should not be selected, people configuring their system need
> > to have the freedom of doing so. All those option are selected in all
> > the big distribution.
> As far as I can tell, the arch/x86/Kconfig doesn't select ARCH_HAS_HMM. 
> Its default is "y", so it should be enabled on anything that meets the 
> dependencies. But ZONE_DEVICE was not enabled by default. I think that's 
> what broke our kernel configs.
> 
> We'll fix our own kernel configs to enable ZONE_DEVICE and ARCH_HAS_HMM 
> to get our internal builds to work again.

You seem to be doing weird thing with your kconfig ...

> 
> I suspect other users with their own kernel configs will stumble over 
> this and wonder why KFD and userptr support are disabled in their builds.

Patch to improve kconfig are welcome but they should not force select
thing. Configuration is there to give user freedom to select fewature
they want to give up.

Maybe following would help:
ARCH_HAS_HMM
-   bool
-   default y
+   def_bool y

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: select ARCH_HAS_HMM and ZONE_DEVICE option

2019-02-20 Thread Jerome Glisse
On Wed, Feb 20, 2019 at 10:39:49PM +, Kuehling, Felix wrote:
> On 2019-02-20 5:12 p.m., Jerome Glisse wrote:
> > On Wed, Feb 20, 2019 at 07:18:17PM +, Kuehling, Felix wrote:
> >> [+Jerome]
> >>
> >> Why to we need ZONE_DEVICE. I didn't think this was needed for mirroring
> >> CPU page tables to device page tables.
> >>
> >> ARCH_HAS_HMM depends on (X86_64 || PPC64). Do we have some alternative
> >> for ARM support?
> >>
> >> Also, the name ARCH_HAS_HMM looks like it's meant to be selected by the
> >> CPU architecture rather than any driver. Jerome, do you have any advice?
> > This patch is wrong you need to depend on ARCH_HAS_HMM and
> 
> Who selects ARCH_HAS_HMM? Currently I don't see this selected anywhere. 
> So any config option that depends on it will be invisible in menuconfig. 
> Do we need ARCH_HAS_HMM somewhere in the arch/x86/Kconfig and 
> arch/powerpc/Kconfig?
> 
> Also, ARCH_HAS_HMM does not currently support ARM. Does that mean we 
> can't have ARM support in AMDGPU if we start using HMM?

ARCH_HAS_HMM is defined by architecture that support HMM. So par x86
and PPC. It should not be hard to add it to ARM (i can not remember if
ARM has DAX yet or not, if ARM does not have DAX then you need to add
that first).

> 
> Finally, ARCH_HAS_HMM has a bunch of dependencies. If they are not met, 
> I guess it can't be enabled. Should those be "select"s instead?

No they should not be selected, people configuring their system need
to have the freedom of doing so. All those option are selected in all
the big distribution.

> config ARCH_HAS_HMM
>  bool
>  default y
>  depends on (X86_64 || PPC64)
>  depends on ZONE_DEVICE
>  depends on MMU && 64BIT
>  depends on MEMORY_HOTPLUG
>  depends on MEMORY_HOTREMOVE
>  depends on SPARSEMEM_VMEMMAP
> 

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: select ARCH_HAS_HMM and ZONE_DEVICE option

2019-02-20 Thread Jerome Glisse
On Wed, Feb 20, 2019 at 07:18:17PM +, Kuehling, Felix wrote:
> [+Jerome]
> 
> Why to we need ZONE_DEVICE. I didn't think this was needed for mirroring 
> CPU page tables to device page tables.
> 
> ARCH_HAS_HMM depends on (X86_64 || PPC64). Do we have some alternative 
> for ARM support?
> 
> Also, the name ARCH_HAS_HMM looks like it's meant to be selected by the 
> CPU architecture rather than any driver. Jerome, do you have any advice?

This patch is wrong you need to depend on ARCH_HAS_HMM and
select HMM_MIRROR you do not need to select ZONE_DEVICE

So it should look like:

config DRM_AMDGPU_USERPTR
bool "Always enable userptr write support"
depends on DRM_AMDGPU
depends on ARCH_HAS_HMM
select HMM_MIRROR
help
  This option selects CONFIG_HMM and CONFIG_HMM_MIRROR if it
  isn't already selected to enabled full userptr support.

I have not got around to work on amdgpu on that respect yet
but it is on my todo list unless someone else beat me to it :)

Cheers,
Jérôme

> 
> Thanks,
>    Felix
> 
> On 2019-02-20 1:56 p.m., Yang, Philip wrote:
> > Those options are needed to support HMM
> >
> > Change-Id: Ieb7bb3bcec07245d79a02793e6728228decc400a
> > Signed-off-by: Philip Yang 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/Kconfig | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/Kconfig 
> > b/drivers/gpu/drm/amd/amdgpu/Kconfig
> > index 960a63355705..63f0542bc34b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/Kconfig
> > +++ b/drivers/gpu/drm/amd/amdgpu/Kconfig
> > @@ -26,7 +26,9 @@ config DRM_AMDGPU_CIK
> >   config DRM_AMDGPU_USERPTR
> > bool "Always enable userptr write support"
> > depends on DRM_AMDGPU
> > +   select ARCH_HAS_HMM
> > select HMM_MIRROR
> > +   select ZONE_DEVICE
> > help
> >   This option selects CONFIG_HMM and CONFIG_HMM_MIRROR if it
> >   isn't already selected to enabled full userptr support.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse
On Fri, Aug 24, 2018 at 06:40:03PM +0200, Michal Hocko wrote:
> On Fri 24-08-18 11:12:40, Jerome Glisse wrote:
> [...]
> > I am fine with Michal patch, i already said so couple month ago first time
> > this discussion did pop up, Michal you can add:
> > 
> > Reviewed-by: Jérôme Glisse 
> 
> So I guess the below is the patch you were talking about?
> 
> From f7ac75277d526dccd011f343818dc6af627af2af Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Fri, 24 Aug 2018 15:32:24 +0200
> Subject: [PATCH] mm, mmu_notifier: be explicit about range invalition
>  non-blocking mode
> 
> If invalidate_range_start is called for !blocking mode then all
> callbacks have to guarantee they will no block/sleep. The same obviously
> applies to invalidate_range_end because this operation pairs with the
> former and they are called from the same context. Make sure this is
> appropriately documented.

In my branch i already updated HMM to be like other existing user
ie all blocking operation in the start callback. But yes it would
be wise to added such comments.


> 
> Signed-off-by: Michal Hocko 
> ---
>  include/linux/mmu_notifier.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 133ba78820ee..698e371aafe3 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
>*
>* If blockable argument is set to false then the callback cannot
>* sleep and has to return with -EAGAIN. 0 should be returned
> -  * otherwise.
> +  * otherwise. Please note that if invalidate_range_start approves
> +  * a non-blocking behavior then the same applies to
> +  * invalidate_range_end.
>*
>*/
>   int (*invalidate_range_start)(struct mmu_notifier *mn,
> -- 
> 2.18.0
> 
> -- 
> Michal Hocko
> SUSE Labs
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse
On Fri, Aug 24, 2018 at 11:52:25PM +0900, Tetsuo Handa wrote:
> On 2018/08/24 22:32, Michal Hocko wrote:
> > On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
> >> I worry that (currently
> >> out-of-tree) users of this API are involving work / recursion.
> > 
> > I do not give a slightest about out-of-tree modules. They will have to
> > accomodate to the new API. I have no problems to extend the
> > documentation and be explicit about this expectation.
> 
> You don't need to care about out-of-tree modules. But you need to hear from
> mm/hmm.c authors/maintainers when making changes for mmu-notifiers.
> 
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 133ba78820ee..698e371aafe3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
> >  *
> >  * If blockable argument is set to false then the callback cannot
> >  * sleep and has to return with -EAGAIN. 0 should be returned
> > -* otherwise.
> > +* otherwise. Please note that if invalidate_range_start approves
> > +* a non-blocking behavior then the same applies to
> > +* invalidate_range_end.
> 
> Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu
> notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to
> mmu-notifiers users.
> 
>   -* If both of these callbacks cannot block, and invalidate_range
>   -* cannot block, mmu_notifier_ops.flags should have
>   -* MMU_INVALIDATE_DOES_NOT_BLOCK set.
>   +* If blockable argument is set to false then the callback 
> cannot
>   +* sleep and has to return with -EAGAIN. 0 should be returned
>   +* otherwise.
> 
> Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e.
> make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK.
> 
> Now we are in a merge window. And we noticed a possibility that out-of-tree
> mmu-notifiers users might have trouble with making changes immediately in 
> order
> to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately.
> And you are trying to ignore such possibility by just updating expected 
> behavior
> description instead of giving out-of-tree users a grace period to check and 
> update
> their code.

Intention is that 99% of HMM users will be upstream as long as they are
not people shouldn't worry. We have been working on nouveau to use it
for the last year or so. Many bits were added in 4.16, 4.17, 4.18 and i
hope it will all be there in 4.20/4.21 timeframe.

See my other mail for list of other users.

> 
> >> and keeps "all operations protected by hmm->mirrors_sem held for write are
> >> atomic". This suggests that "some operations protected by hmm->mirrors_sem 
> >> held
> >> for read will sleep (and in the worst case involves memory allocation
> >> dependency)".
> > 
> > Yes and so what? The clear expectation is that neither of the range
> > notifiers do not sleep in !blocking mode. I really fail to see what you
> > are trying to say.
> 
> I'm saying "Get ACK from Jérôme about mm/hmm.c changes".

I am fine with Michal patch, i already said so couple month ago first time
this discussion did pop up, Michal you can add:

Reviewed-by: Jérôme Glisse 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse
On Fri, Aug 24, 2018 at 02:33:41PM +0200, Michal Hocko wrote:
> On Fri 24-08-18 14:18:44, Christian König wrote:
> > Am 24.08.2018 um 14:03 schrieb Michal Hocko:
> > > On Fri 24-08-18 13:57:52, Christian König wrote:
> > > > Am 24.08.2018 um 13:52 schrieb Michal Hocko:
> > > > > On Fri 24-08-18 13:43:16, Christian König wrote:
> > > [...]
> > > > > > That won't work like this there might be multiple
> > > > > > invalidate_range_start()/invalidate_range_end() pairs open at the 
> > > > > > same time.
> > > > > > E.g. the lock might be taken recursively and that is illegal for a
> > > > > > rw_semaphore.
> > > > > I am not sure I follow. Are you saying that one invalidate_range might
> > > > > trigger another one from the same path?
> > > > No, but what can happen is:
> > > > 
> > > > invalidate_range_start(A,B);
> > > > invalidate_range_start(C,D);
> > > > ...
> > > > invalidate_range_end(C,D);
> > > > invalidate_range_end(A,B);
> > > > 
> > > > Grabbing the read lock twice would be illegal in this case.
> > > I am sorry but I still do not follow. What is the context the two are
> > > called from?
> > 
> > I don't have the slightest idea.
> > 
> > > Can you give me an example. I simply do not see it in the
> > > code, mostly because I am not familiar with it.
> > 
> > I'm neither.
> > 
> > We stumbled over that by pure observation and after discussing the problem
> > with Jerome came up with this solution.
> > 
> > No idea where exactly that case comes from, but I can confirm that it indeed
> > happens.
> 
> Thiking about it some more, I can imagine that a notifier callback which
> performs an allocation might trigger a memory reclaim and that in turn
> might trigger a notifier to be invoked and recurse. But notifier
> shouldn't really allocate memory. They are called from deep MM code
> paths and this would be extremely deadlock prone. Maybe Jerome can come
> up some more realistic scenario. If not then I would propose to simplify
> the locking here. We have lockdep to catch self deadlocks and it is
> always better to handle a specific issue rather than having a code
> without a clear indication how it can recurse.

Multiple concurrent mmu notifier, for overlapping range or not, is
common (each concurrent threads can trigger some). So you might have
multiple invalidate_range_start() in flight for same mm and thus might
complete in different order (invalidate_range_end()). IIRC this is
what this lock was trying to protect against.

I can't think of a reason for recursive mmu notifier call right now.
I will ponder see if i remember something about it.

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse
On Fri, Aug 24, 2018 at 07:54:19PM +0900, Tetsuo Handa wrote:
> Two more worries for this patch.

[...]

> 
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, 
> > struct mm_struct *mm)
> > up_write(>mirrors_sem);
> >  }
> > 
> > -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> > +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
> >struct mm_struct *mm,
> >unsigned long start,
> > -  unsigned long end)
> > +  unsigned long end,
> > +  bool blockable)
> >  {
> > struct hmm *hmm = mm->hmm;
> > 
> > VM_BUG_ON(!hmm);
> > 
> > atomic_inc(>sequence);
> > +
> > +   return 0;
> >  }
> > 
> >  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> 
> This assumes that hmm_invalidate_range_end() does not have memory
> allocation dependency. But hmm_invalidate_range() from
> hmm_invalidate_range_end() involves
> 
> down_read(>mirrors_sem);
> list_for_each_entry(mirror, >mirrors, list)
> mirror->ops->sync_cpu_device_pagetables(mirror, action,
> start, end);
> up_read(>mirrors_sem);
> 
> sequence. What is surprising is that there is no in-tree user who assigns
> sync_cpu_device_pagetables field.
> 
>   $ grep -Fr sync_cpu_device_pagetables *
>   Documentation/vm/hmm.rst: /* sync_cpu_device_pagetables() - synchronize 
> page tables
>   include/linux/hmm.h: * will get callbacks through 
> sync_cpu_device_pagetables() operation (see
>   include/linux/hmm.h:/* sync_cpu_device_pagetables() - synchronize page 
> tables
>   include/linux/hmm.h:void (*sync_cpu_device_pagetables)(struct 
> hmm_mirror *mirror,
>   include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() 
> callback, so that CPU page
>   mm/hmm.c:   mirror->ops->sync_cpu_device_pagetables(mirror, 
> action,
> 
> That is, this API seems to be currently used by only out-of-tree users. Since
> we can't check that nobody has memory allocation dependency, I think that
> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for 
> now.

So you can see update and user of this there:

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-intel-v00
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau-v01
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-radeon-v00

I am still working on Mellanox and AMD GPU patchset.

I will post the HMM changes that adapt to Michal shortly as anyway
thus have been sufficiently tested by now.

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-4.20=78785dcb5ba0924c2c5e7be027793f99ebbc39f3
https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-4.20=4fc25571dc893f2b278e90cda9e71e139e01de70

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Jerome Glisse
On Fri, Jun 22, 2018 at 06:42:43PM +0200, Michal Hocko wrote:
> [Resnding with the CC list fixed]
> 
> On Fri 22-06-18 18:40:26, Michal Hocko wrote:
> > On Fri 22-06-18 12:18:46, Jerome Glisse wrote:
> > > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> > > > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > > > Hi,
> > > > > > this is an RFC and not tested at all. I am not very familiar with 
> > > > > > the
> > > > > > mmu notifiers semantics very much so this is a crude attempt to 
> > > > > > achieve
> > > > > > what I need basically. It might be completely wrong but I would like
> > > > > > to discuss what would be a better way if that is the case.
> > > > > > 
> > > > > > get_maintainers gave me quite large list of people to CC so I had 
> > > > > > to trim
> > > > > > it down. If you think I have forgot somebody, please let me know
> > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> > > > > > b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > index 854bd51b9478..5285df9331fa 100644
> > > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object 
> > > > > > *mo)
> > > > > > mo->attached = false;
> > > > > >  }
> > > > > >  
> > > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct 
> > > > > > mmu_notifier *_mn,
> > > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct 
> > > > > > mmu_notifier *_mn,
> > > > > >struct 
> > > > > > mm_struct *mm,
> > > > > >unsigned 
> > > > > > long start,
> > > > > > -  unsigned 
> > > > > > long end)
> > > > > > +  unsigned 
> > > > > > long end,
> > > > > > +  bool 
> > > > > > blockable)
> > > > > >  {
> > > > > > struct i915_mmu_notifier *mn =
> > > > > > container_of(_mn, struct i915_mmu_notifier, mn);
> > > > > > @@ -124,7 +125,7 @@ static void 
> > > > > > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > > LIST_HEAD(cancelled);
> > > > > >  
> > > > > > if (RB_EMPTY_ROOT(>objects.rb_root))
> > > > > > -   return;
> > > > > > +   return 0;
> > > > > 
> > > > > The principle wait here is for the HW (even after fixing all the locks
> > > > > to be not so coarse, we still have to wait for the HW to finish its
> > > > > access).
> > > > 
> > > > Is this wait bound or it can take basically arbitrary amount of time?
> > > 
> > > Arbitrary amount of time but in desktop use case you can assume that
> > > it should never go above 16ms for a 60frame per second rendering of
> > > your desktop (in GPU compute case this kind of assumption does not
> > > hold). Is the process exit_state already updated by the time this mmu
> > > notifier callbacks happen ?
> > 
> > What do you mean? The process is killed (by SIGKILL) at the time but we
> > do not know much more than that. The task might be stuck anywhere in the
> > kernel before handling that signal.

I was wondering if another thread might still be dereferencing any of
the structure concurrently with the OOM mmu notifier callback. Saddly
yes, it would be simpler if we could make such assumption.

> > 
> > > > > The first pass would be then to not do anything here if
> > > > > !blockable.
> > > > 
> > > > something like this? (incremental diff)
> > > 
> > > What i wanted to do with HMM and mmu notifier is split the invalidation
> > > in 2 pass. First pass tell the drivers t

Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Jerome Glisse
On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > Hi,
> > > this is an RFC and not tested at all. I am not very familiar with the
> > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > what I need basically. It might be completely wrong but I would like
> > > to discuss what would be a better way if that is the case.
> > > 
> > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > it down. If you think I have forgot somebody, please let me know
> > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> > > b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > index 854bd51b9478..5285df9331fa 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > mo->attached = false;
> > >  }
> > >  
> > > -static void i915_gem_userptr_mn_invalidate_range_start(struct 
> > > mmu_notifier *_mn,
> > > +static int i915_gem_userptr_mn_invalidate_range_start(struct 
> > > mmu_notifier *_mn,
> > >struct mm_struct 
> > > *mm,
> > >unsigned long 
> > > start,
> > > -  unsigned long end)
> > > +  unsigned long end,
> > > +  bool blockable)
> > >  {
> > > struct i915_mmu_notifier *mn =
> > > container_of(_mn, struct i915_mmu_notifier, mn);
> > > @@ -124,7 +125,7 @@ static void 
> > > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > LIST_HEAD(cancelled);
> > >  
> > > if (RB_EMPTY_ROOT(>objects.rb_root))
> > > -   return;
> > > +   return 0;
> > 
> > The principle wait here is for the HW (even after fixing all the locks
> > to be not so coarse, we still have to wait for the HW to finish its
> > access).
> 
> Is this wait bound or it can take basically arbitrary amount of time?

Arbitrary amount of time but in desktop use case you can assume that
it should never go above 16ms for a 60frame per second rendering of
your desktop (in GPU compute case this kind of assumption does not
hold). Is the process exit_state already updated by the time this mmu
notifier callbacks happen ?

> 
> > The first pass would be then to not do anything here if
> > !blockable.
> 
> something like this? (incremental diff)

What i wanted to do with HMM and mmu notifier is split the invalidation
in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
depends on the range and invalidate internal driver states (like clear
buffer object pages array in case of GPU but not GPU page table). While
the second callback would do the actual wait on the GPU to be done and
update the GPU page table.

Now in this scheme in case the task is already in some exit state and
that all CPU threads are frozen/kill then we can probably find a way to
do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
share userptr bo hence a uptr bo should only ever be access through
ioctl submited by the process.

The second call can then be delayed and ping from time to time to see
if GPU jobs are done.


Note that what you propose might still be useful as in case there is
no buffer object for a range then OOM can make progress in freeing a
range of memory. It is very likely that significant virtual address
range of a process and backing memory can be reclaim that way. This
assume OOM reclaim vma by vma or in some form of granularity like
reclaiming 1GB by 1GB. Or we could also update blocking callback to
return range that are blocking that way OOM can reclaim around.

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-03 Thread Jerome Glisse
On Tue, Apr 03, 2018 at 11:09:09AM +0200, Daniel Vetter wrote:
> On Thu, Mar 29, 2018 at 01:34:24PM +0200, Christian König wrote:
> > Am 29.03.2018 um 08:57 schrieb Daniel Vetter:
> > > On Sun, Mar 25, 2018 at 12:59:56PM +0200, Christian König wrote:
> > > > Add a peer2peer flag noting that the importer can deal with device
> > > > resources which are not backed by pages.
> > > > 
> > > > Signed-off-by: Christian König 
> > > Um strictly speaking they all should, but ttm never bothered to use the
> > > real interfaces but just hacked around the provided sg list, grabbing the
> > > underlying struct pages, then rebuilding the sg list again.
> > 
> > Actually that isn't correct. TTM converts them to a dma address array
> > because drivers need it like this (at least nouveau, radeon and amdgpu).
> > 
> > I've fixed radeon and amdgpu to be able to deal without it and mailed with
> > Ben about nouveau, but the outcome is they don't really know.
> > 
> > TTM itself doesn't have any need for the pages on imported BOs (you can't
> > mmap them anyway), the real underlying problem is that sg tables doesn't
> > provide what drivers need.
> > 
> > I think we could rather easily fix sg tables, but that is a totally separate
> > task.
> 
> Looking at patch 8, the sg table seems perfectly sufficient to convey the
> right dma addresses to the importer. Ofcourse the exporter has to set up
> the right kind of iommu mappings to make this work.
> 
> > > The entire point of using sg lists was exactly to allow this use case of
> > > peer2peer dma (or well in general have special exporters which managed
> > > memory/IO ranges not backed by struct page). So essentially you're having
> > > a "I'm totally not broken flag" here.
> > 
> > No, independent of needed struct page pointers we need to note if the
> > exporter can handle peer2peer stuff from the hardware side in general.
> > 
> > So what I've did is just to set peer2peer allowed on the importer because of
> > the driver needs and clear it in the exporter if the hardware can't handle
> > that.
> 
> The only thing the importer seems to do is call the
> pci_peer_traffic_supported, which the exporter could call too. What am I
> missing (since the sturct_page stuff sounds like it's fixed already by
> you)?
> -Daniel

AFAIK Logan patchset require to register and initialize struct page
for the device memory you want to map (export from exporter point of
view).

With GPU this isn't something we want, struct page is >~= 2^6 so for
4GB GPU = 2^6*2^32/2^12 = 2^26 = 64MB of RAM
8GB GPU = 2^6*2^33/2^12 = 2^27 = 128MB of RAM
16GB GPU = 2^6*2^34/2^12 = 2^28 = 256MB of RAM
32GB GPU = 2^6*2^34/2^12 = 2^29 = 512MB of RAM

All this is mostly wasted as only a small sub-set (that can not be
constraint to specific range) will ever be exported at any point in
time. For GPU work load this is hardly justifiable, even for HMM i
do not plan to register all those pages.

Hence why i argue that dma_map_resource() like use by Christian is
good enough for us. People that care about SG can fix that but i
rather not have to depend on that and waste system memory.

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-04-02 Thread Jerome Glisse
On Mon, Apr 02, 2018 at 01:32:37PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 02/04/18 01:16 PM, Jerome Glisse wrote:
> > There isn't good API at the moment AFAIK, closest thing would either be
> > lookup_resource() or region_intersects(), but a more appropriate one can
> > easily be added, code to walk down the tree is readily available. More-
> > over this can be optimize like vma lookup are, even more as resource are
> > seldomly added so read side (finding a resource) can be heavily favor
> > over write side (adding|registering a new resource).
> 
> So someone needs to create a highly optimized tree that registers all
> physical address on the system and maps them to devices? That seems a
> long way from being realized. I'd hardly characterize that as "easily".
> If we can pass both devices to the API I'd suspect it would be preferred
> over the complicated tree. This, of course, depends on what users of the
> API need.

This tree already exist, it is all there upstream see kernel/resource.c
What is missing is something that take a single address and return the
device struct. There is function that take a range region_intersects()
or one that take the start address lookup_resource(). It isn't hard to
think that using roughly same code as region_intersects() an helper
that return the device for a resource can be added.

And yes currently this does not have a pointer back to the device that
own the resource but this can be added. It wasn't needed until now.

It can latter be optimize if device lookup shows as a bottleneck in perf
profile.


> 
> > cache coherency protocol (bit further than PCIE snoop). But also the
> > other direction the CPU access to device memory can also be cache coherent,
> > which is not the case in PCIE.
> 
> I was not aware that CAPI allows PCI device memory to be cache coherent.
> That sounds like it would be very tricky...

And yet CAPI, CCIX, Gen-Z, NVLink, ... are all inter-connect that aim at
achieving this cache coherency between multiple devices and CPUs.

Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-04-02 Thread Jerome Glisse
On Mon, Apr 02, 2018 at 11:37:07AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 02/04/18 11:20 AM, Jerome Glisse wrote:
> > The point i have been trying to get accross is that you do have this
> > information with dma_map_resource() you know the device to which you
> > are trying to map (dev argument to dma_map_resource()) and you can
> > easily get the device to which the memory belongs because you have the
> > CPU physical address of the memory hence you can lookup the resource
> > and get the device from that.
> 
> How do you go from a physical address to a struct device generally and
> in a performant manner?

There isn't good API at the moment AFAIK, closest thing would either be
lookup_resource() or region_intersects(), but a more appropriate one can
easily be added, code to walk down the tree is readily available. More-
over this can be optimize like vma lookup are, even more as resource are
seldomly added so read side (finding a resource) can be heavily favor
over write side (adding|registering a new resource).

> 
> > IIRC CAPI make P2P mandatory but maybe this is with NVLink. We can ask
> > the PowerPC folks to confirm. Note CAPI is Power8 and newer AFAICT.
> 
> PowerPC folks recently told us specifically that Power9 does not support
> P2P between PCI root ports. I've said this many times. CAPI has nothing
> to do with it.

I need to check CAPI, i must have confuse that with NVLink which is also
on some powerpc arch.

> 
> > Mapping to userspace have nothing to do here. I am talking at hardware
> > level. How thing are expose to userspace is a completely different
> > problems that do not have one solution fit all. For GPU you want this
> > to be under total control of GPU drivers. For storage like persistent
> > memory, you might want to expose it userspace more directly ...
> 
> My understanding (and I worked on this a while ago) is that CAPI
> hardware manages memory maps typically for userspace memory. When a
> userspace program changes it's mapping, the CAPI hardware is updated so
> that hardware is coherent with the user address space and it is safe to
> DMA to any address without having to pin memory. (This is very similar
> to ODP in RNICs.) This is *really* nice but doesn't solve *any* of the
> problems we've been discussing. Moreover, many developers want to keep
> P2P in-kernel, for the time being, where the problem of pinning memory
> does not exist.

What you describe is the ATS(Address Translation Service)/PASID(Process
Address Space IDentifier) part of CAPI. Which have also been available
for years on AMD x86 platform (AMD IOMMU-v2), thought it is barely ever
use. Interesting aspect of CAPI is its cache coherency protocol between
devices and CPUs. This in both direction, the usual device access to
system memory can be cache coherent with CPU access and participate in
cache coherency protocol (bit further than PCIE snoop). But also the
other direction the CPU access to device memory can also be cache coherent,
which is not the case in PCIE.

This cache coherency between CPU and device is what made me assume that
CAPI must have Peer To Peer support as peer must be able to talk to each
other for cache coherency purpose. But maybe all cache coherency
arbritration goes through central directory allievating Peer to Peer
requirement.

Anyway, like you said, this does not matter for the discussion. The
dma_map_resource() can be just stub out on platform that do not support
this and they would not allow it. If it get use on other platform and
shows enough advantages that users start asking for it then maybe those
platform will attention to the hardware requirement.

Note that with mmu_notifier there isn't any need to pin stuff (even
without any special hardware capabilities), as long as you can preempt
what is happening on your hardware to update its page table.

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-04-02 Thread Jerome Glisse
On Mon, Apr 02, 2018 at 11:02:10AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 30/03/18 01:45 PM, Jerome Glisse wrote:
> > Looking at upstream code it seems that the x86 bits never made it upstream
> > and thus what is now upstream is only for ARM. See [1] for x86 code. Dunno
> > what happen, i was convince it got merge. So yes current code is broken on
> > x86. ccing Joerg maybe he remembers what happened there.
> > 
> > [1] https://lwn.net/Articles/646605/
> 
> That looks like a significant improvement over what's upstream. But it's
> three years old and looks to have been abandoned. I think I agree with
> Bjorn's comments in that if it's going to be a general P2P API it will
> need the device the resource belongs to in addition to the device that's
> initiating the DMA request.

The point i have been trying to get accross is that you do have this
information with dma_map_resource() you know the device to which you
are trying to map (dev argument to dma_map_resource()) and you can
easily get the device to which the memory belongs because you have the
CPU physical address of the memory hence you can lookup the resource
and get the device from that.


> > Given it is currently only used by ARM folks it appear to at least work
> > for them (tm) :) Note that Christian is doing this in PCIE only context
> > and again dma_map_resource can easily figure that out if the address is
> > a PCIE or something else. Note that the exporter export the CPU bus
> > address. So again dma_map_resource has all the informations it will ever
> > need, if the peer to peer is fundamentaly un-doable it can return dma
> > error and it is up to the caller to handle this, just like GPU code do.
> > 
> > Do you claim that dma_map_resource is missing any information ?
> 
> Yes, that's what I said. All the existing ARM code appears to use it for
> platform devices only. I suspect platform P2P is relatively trivial to
> support on ARM. I think it's a big difference from using it for PCI P2P
> or general P2P on any bus.
> 

It does have all we need for PCIE when using dma_api and not the SG one.
SG issue IIRC is that it assume struct page ... See above for device
lookup.

> > I agree and understand that but for platform where such feature make sense
> > this will work. For me it is PowerPC and x86 and given PowerPC has CAPI
> > which has far more advance feature when it comes to peer to peer, i don't
> > see something more basic not working. On x86, Intel is a bit of lone wolf,
> > dunno if they gonna support this usecase pro-actively. AMD definitly will.
> 
> Well PowerPC doesn't even support P2P between root ports. And I fail to
> see how CAPI applies unless/until we get this memory mapped into
> userspace and the mappings need to be dynamically managed. Seems like
> that's a long way away.

IIRC CAPI make P2P mandatory but maybe this is with NVLink. We can ask
the PowerPC folks to confirm. Note CAPI is Power8 and newer AFAICT.

Mapping to userspace have nothing to do here. I am talking at hardware
level. How thing are expose to userspace is a completely different
problems that do not have one solution fit all. For GPU you want this
to be under total control of GPU drivers. For storage like persistent
memory, you might want to expose it userspace more directly ...

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-03-30 Thread Jerome Glisse
On Thu, Mar 29, 2018 at 11:33:34PM -0700, Christoph Hellwig wrote:
> On Thu, Mar 29, 2018 at 09:58:54PM -0400, Jerome Glisse wrote:
> > dma_map_resource() is the right API (thought its current implementation
> > is fill with x86 assumptions). So i would argue that arch can decide to
> > implement it or simply return dma error address which trigger fallback
> > path into the caller (at least for GPU drivers). SG variant can be added
> > on top.
> 
> It isn't in general.  It doesn't integrate with scatterlists (see my
> comment to page one), and it doesn't integrate with all the subsystems
> that also need a kernel virtual address.

IIRC SG variant was proposed at the time dma_map_resource() was added,
dunno why they did not make it (again from memory i think it was because
it grows the scatterlist struct size). My point is more than people that
need SG variant should work on adding it. People who do not, should not
be stop by the lack of it. There is something there upstream, it does
not make sense to not use it. The dma-buf infrastructure is useful to
many drivers.

If you do not want to share the underlying logic of dma_map_resource()
fine (ie not sharing code inside drivers/iommu/*). I thought it would
be a good thing to share code ...

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-03-30 Thread Jerome Glisse
On Thu, Mar 29, 2018 at 10:25:52AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 29/03/18 10:10 AM, Christian König wrote:
> > Why not? I mean the dma_map_resource() function is for P2P while other 
> > dma_map_* functions are only for system memory.
> 
> Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping
> P2P. Though it's a bit odd seeing we've been working under the
> assumption that PCI P2P is different as it has to translate the PCI bus
> address. Where as P2P for devices on other buses is a big unknown.

dma_map_resource() is the right API (thought its current implementation
is fill with x86 assumptions). So i would argue that arch can decide to
implement it or simply return dma error address which trigger fallback
path into the caller (at least for GPU drivers). SG variant can be added
on top.

dma_map_resource() is the right API because it has all the necessary
informations. It use the CPU physical address as the common address
"language" with CPU physical address of PCIE bar to map to another
device you can find the corresponding bus address from the IOMMU code
(NOP on x86). So dma_map_resource() knows both the source device which
export its PCIE bar and the destination devices.


So i don't think Christian need to base his patchset on yours. This
is orthogonal. Christian is using existing upstream API, if it is
broken on some arch it is up to those arch to fix it, or return error
if it is not fixable. In fact i would assume that you would need to
base your patchset on top of dma_map_resource() too ...

Moreover i doubt Christian want to have struct page for this. For
nouveau there will be HMM struct page and this would conflict.

AFAICT you need struct page because the API you are trying to expose
to user space rely on "regular" filesystem/RDMA umem API which all
assume struct page. GPU drivers do not have this requirement and it
should not be impose on them.


So from my point of view Christian patchset is good as it is. Modulo
better commit message.


Bottom line i think we need common PCIE helper for P2P and the current
dma_map_resource() is the right kind from POV. What you are doing with
struct page is specific to your sub-system and should not be impose on
others.

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: New KFD ioctls: taking the skeletons out of the closet

2018-03-06 Thread Jerome Glisse
On Tue, Mar 06, 2018 at 05:44:41PM -0500, Felix Kuehling wrote:
> Hi all,
> 
> Christian raised two potential issues in a recent KFD upstreaming code
> review that are related to the KFD ioctl APIs:
> 
>  1. behaviour of -ERESTARTSYS
>  2. transactional nature of KFD ioctl definitions, or lack thereof
> 
> I appreciate constructive feedback, but I also want to encourage an
> open-minded rather than a dogmatic approach to API definitions. So let
> me take all the skeletons out of my closet and get these APIs reviewed
> in the appropriate forum before we commit to them upstream. See the end
> of this email for reference.
> 
> The controversial part at this point is kfd_ioctl_map_memory_to_gpu. If
> any of the other APIs raise concerns or questions, please ask.
> 
> Because of the HSA programming model, KFD memory management APIs are
> synchronous. There is no pipelining. Command submission to GPUs through
> user mode queues does not involve KFD. This means KFD doesn't know what
> memory is used by the GPUs and when it's used. That means, when the
> map_memory_to_gpu ioctl returns to user mode, all memory mapping
> operations are complete and the memory can be used by the CPUs or GPUs
> immediately.
> 
> HSA also uses a shared virtual memory model, so typically memory gets
> mapped on multiple GPUs and CPUs at the same virtual address.

Does this means that GPU memory get pin ? Or system memory for that matter
too. This was discuss previously but this really goes against kernel mantra
ie kernel no longer manage resources but userspace can hog GPU memory or
even system memory. This is bad !

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback

2017-08-31 Thread Jerome Glisse
On Thu, Aug 31, 2017 at 07:19:19PM -0400, Felix Kuehling wrote:
> On 2017-08-31 03:00 PM, Jerome Glisse wrote:
> > I was not saying you should not use mmu_notifier. For achieving B you need
> > mmu_notifier. Note that if you do like ODP/KVM then you do not need to
> > pin page.
> I would like that. I've thought about it before. The one problem I
> couldn't figure out is, where to set the accessed and dirty bits for the
> pages. Now we do it when we unpin. If we don't pin the pages in the
> first place, we don't have a good place for this.
> 
> Our hardware doesn't give us notifications or accessed/dirty bits, so we
> have to assume the worst case that the pages are continuously
> accessed/dirty.
> 
> I'd appreciate any advice how to handle that. (Sorry, I realize this is
> going a bit off topic.) A pointer to a document or source code would be
> great. :)

In invalidate_range_start() ie same place as where you unpin but instead
of unpining you would just call set_page_dirty()

Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback

2017-08-31 Thread Jerome Glisse
On Thu, Aug 31, 2017 at 02:39:17PM -0400, Felix Kuehling wrote:
> On 2017-08-31 09:59 AM, Jerome Glisse wrote:
> > [Adding Intel folks as they might be interested in this discussion]
> >
> > On Wed, Aug 30, 2017 at 05:51:52PM -0400, Felix Kuehling wrote:
> >> Hi Jérôme,
> >>
> >> I have some questions about the potential range-start-end race you
> >> mentioned.
> >>
> >> On 2017-08-29 07:54 PM, Jérôme Glisse wrote:
> >>> Note that a lot of existing user feels broken in respect to range_start/
> >>> range_end. Many user only have range_start() callback but there is nothing
> >>> preventing them to undo what was invalidated in their range_start() 
> >>> callback
> >>> after it returns but before any CPU page table update take place.
> >>>
> >>> The code pattern use in kvm or umem odp is an example on how to properly
> >>> avoid such race. In a nutshell use some kind of sequence number and active
> >>> range invalidation counter to block anything that might undo what the
> >>> range_start() callback did.
> >> What happens when we start monitoring an address range after
> >> invaligate_range_start was called? Sounds like we have to keep track of
> >> all active invalidations for just such a case, even in address ranges
> >> that we don't currently care about.
> >>
> >> What are the things we cannot do between invalidate_range_start and
> >> invalidate_range_end? amdgpu calls get_user_pages to re-validate our
> >> userptr mappings after the invalidate_range_start notifier invalidated
> >> it. Do we have to wait for invalidate_range_end before we can call
> >> get_user_pages safely?
> > Well the whole userptr bo object is somewhat broken from the start.
> > You never defined the semantic of it ie what is expected. I can
> > think of 2 differents semantics:
> >   A) a uptr buffer object is a snapshot of a memory at the time of
> >  uptr buffer object creation
> >   B) a uptr buffer object allow GPU to access a range of virtual
> >  address of a process an share coherent view of that range
> >  between CPU and GPU
> >
> > As it was implemented it is more inline with B but it is not defined
> > anywhere AFAICT.
> 
> Yes, we're trying to achieve B, that's why we have an MMU notifier in
> the first place. But it's been a struggle getting it to work properly,
> and we're still dealing with some locking issues and now this one.
> 
> >
> > Anyway getting back to your questions, it kind of doesn't matter as
> > you are using GUP ie you are pinning pages except for one scenario
> > (at least i can only think of one).
> >
> > Problematic case is race between CPU write to zero page or COW and
> > GPU driver doing read only GUP:
> [...]
> 
> Thanks, I was aware of COW but not of the zero-page case. I believe in
> most practical cases our userptr mappings are read-write, so this is
> probably not causing us any real trouble at the moment.
> 
> > So i would first define the semantic of uptr bo and then i would fix
> > accordingly the code. Semantic A is easier to implement and you could
> > just drop the whole mmu_notifier. Maybe it is better to create uptr
> > buffer object everytime you want to snapshot a range of address. I
> > don't think the overhead of buffer creation would matter.
> 
> That doesn't work for KFD and our compute memory model where CPU and GPU
> expect to share the same address space.
> 
> >
> > If you want to close the race for COW and zero page in case of read
> > only GUP there is no other way than what KVM or ODP is doing. I had
> > patchset to simplify all this but i need to bring it back to life.
> 
> OK. I'll look at these to get an idea.
> 
> > Note that other thing might race but as you pin the pages they do
> > not matter. It just mean that if you GUP after range_start() but
> > before range_end() and before CPU page table update then you pinned
> > the same old page again and nothing will happen (migrate will fail,
> > MADV_FREE will nop, ...). So you just did the range_start() callback
> > for nothing in those cases.
> 
> We pin the memory because the GPU wants to access it. So this is
> probably OK if the migration fails. However, your statement that we
> "just did the range_start() callback for nothing" implies that we could
> as well have ignored the range_start callback. But I don't think that's
> true. That way we'd keep a page pinned that is no longer mapped in the
> CPU address space. So the GPU mapping would be out

Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback

2017-08-29 Thread Jerome Glisse
On Tue, Aug 29, 2017 at 05:11:24PM -0700, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 4:54 PM, Jérôme Glisse  wrote:
> >
> > Note this is barely tested. I intend to do more testing of next few days
> > but i do not have access to all hardware that make use of the mmu_notifier
> > API.
> 
> Thanks for doing this.
> 
> > First 2 patches convert existing call of mmu_notifier_invalidate_page()
> > to mmu_notifier_invalidate_range() and bracket those call with call to
> > mmu_notifier_invalidate_range_start()/end().
> 
> Ok, those two patches are a bit more complex than I was hoping for,
> but not *too* bad.
> 
> And the final end result certainly looks nice:
> 
> >  16 files changed, 74 insertions(+), 214 deletions(-)
> 
> Yeah, removing all those invalidate_page() notifiers certainly makes
> for a nice patch.
> 
> And I actually think you missed some more lines that can now be
> removed: kvm_arch_mmu_notifier_invalidate_page() should no longer be
> needed either, so you can remove all of those too (most of them are
> empty inline functions, but x86 has one that actually does something.
> 
> So there's an added 30 or so dead lines that should be removed in the
> kvm patch, I think.

Yes i missed that. I will wait for people to test and for result of my
own test before reposting if need be, otherwise i will post as separate
patch.

> 
> But from a _very_ quick read-through this looks fine. But it obviously
> needs testing.
> 
> People - *especially* the people who saw issues under KVM - can you
> try out Jérôme's patch-series? I aded some people to the cc, the full
> series is on lkml. Jérôme - do you have a git branch for people to
> test that they could easily pull and try out?

https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
git://people.freedesktop.org/~glisse/linux

(Sorry if that tree is bit big it has a lot of dead thing i need
 to push a clean and slim one)

Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 22/24] drm/amdkfd: Adding new IOCTL for scratch memory v2

2017-08-21 Thread Jerome Glisse
On Mon, Aug 21, 2017 at 10:21:48PM +0300, Oded Gabbay wrote:
> On Mon, Aug 21, 2017 at 8:39 PM, Jerome Glisse <j.gli...@gmail.com> wrote:
> > On Tue, Aug 15, 2017 at 11:00:20PM -0400, Felix Kuehling wrote:
> >> From: Moses Reuben <moses.reu...@amd.com>
> >>
> >> v2:
> >> * Renamed ALLOC_MEMORY_OF_SCRATCH to SET_SCRATCH_BACKING_VA
> >> * Removed size parameter from the ioctl, it was unused
> >> * Removed hole in ioctl number space
> >> * No more call to write_config_static_mem
> >> * Return correct error code from ioctl
> >
> > What kind of memory is suppose to back this virtual address
> > range ? How big is the range suppose to be ? Can it be any
> > valid virtual address ?
> >
> > My worry here is to ascertain that one can not abuse this
> > ioctl say to set the virtual address to some mmaped shared
> > library code/data section and write something malicious
> > there.
> >
> > I am assuming that if it has to go through ATS/PASID of the
> > IOMMUv2 then the write protection will be asserted and we
> > will see proper COW (copy on write) due to mmap PRIVATE flags.
> >
> >
> > Idealy this area should be a special vma and the driver
> > should track its lifetime and cancel GPU jobs if it is
> > unmap. But i am unsure on how dynamic is that scratch
> > memory suppose to be (ie do you allocate new scratch memory
> > with every GPU job or is it allocated once and reuse for
> > every jobs).
> >
> > Bigger commit message would be nice too. Like i had tons
> > of i believe valid questions.
> >
> > Cheers,
> > Jérôme
> 
> Hi Jerome,
> Great points.
> 
> This is the explanation Felix gave a few days ago on this ioctl in a
> different email thread:
> 
> "Scratch memory is private memory per work-item. It's used when a shader
> program has too few registers available. With HSA we use flat scratch
> addressing, where shaders can access private memory in a special scratch
> aperture using normal memory instructions. Using the same virtual
> address, each work item gets its own private piece of memory. The
> hardware does the address translation from the VA in the private
> aperture to a scratch-backing VA. The application is responsible for
> allocating the memory to back that scratch area, and to map it somewhere
> in its virtual address space.
> 
> This ioctl tells the hardware (or HWS firmware) the VA of the scratch
> backing memory."
> 
> From this explanation, I think that the user's supplied VA should be
> tested that its a valid writable area of the user.
> How do you test for that ? could you point me to such a code in the kernel ?
> From looking at other drivers that pin host memory, like mellanox nic,
> they don't do any special testing for the address they receive from
> the user.

GUP (get_user_page()) will perform proper check on your behalf.
I am assuming those driver use GUP.

Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 22/24] drm/amdkfd: Adding new IOCTL for scratch memory v2

2017-08-21 Thread Jerome Glisse
On Tue, Aug 15, 2017 at 11:00:20PM -0400, Felix Kuehling wrote:
> From: Moses Reuben 
> 
> v2:
> * Renamed ALLOC_MEMORY_OF_SCRATCH to SET_SCRATCH_BACKING_VA
> * Removed size parameter from the ioctl, it was unused
> * Removed hole in ioctl number space
> * No more call to write_config_static_mem
> * Return correct error code from ioctl

What kind of memory is suppose to back this virtual address
range ? How big is the range suppose to be ? Can it be any
valid virtual address ?

My worry here is to ascertain that one can not abuse this
ioctl say to set the virtual address to some mmaped shared
library code/data section and write something malicious
there.

I am assuming that if it has to go through ATS/PASID of the
IOMMUv2 then the write protection will be asserted and we
will see proper COW (copy on write) due to mmap PRIVATE flags.


Idealy this area should be a special vma and the driver
should track its lifetime and cancel GPU jobs if it is
unmap. But i am unsure on how dynamic is that scratch
memory suppose to be (ie do you allocate new scratch memory
with every GPU job or is it allocated once and reuse for
every jobs).

Bigger commit message would be nice too. Like i had tons
of i believe valid questions.

Cheers,
Jérôme
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 02/23] drm/amdgpu: change wptr to 64 bits

2017-06-16 Thread Jerome Glisse
On Thu, Mar 09, 2017 at 10:57:27AM +0100, Christian König wrote:
> Am 09.03.2017 um 04:44 schrieb Alex Deucher:
> > From: Ken Wang 
> > 
> > Newer asics use 64 bit wptrs
> 
> We need a better patch description. Newer asics doesn't use 64bit wptrs, but
> rather need them!
> 
> E.g. if the wptr is now smaller than the rptr that doesn't indicate a wrap
> around any more.

What does it means ? Doesn't make sense, you are allocating 1TB command
buffer and know you will never run out of space ? ;)

Either we are talking about a ring buffer and then wptr < rptr means wrap
around or we are talking about something new that is not a ring buffer.

64 bits or 32 bits doesn't matter from ring buffer point of view. So i
am puzzle about what this new thing is. Anywhere i can read about this
new command buffer thingy ? Because from where i stand it looks like it
would be better to be something else than a ring buffer if there is no
wrap around.

Cheers,
Jérôme 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx