Re: Deadlock on PTEs update for HMM

2019-12-05 Thread Christian König

Hi guys,

I was wondering why the heck Felix wants to add another lock for over an 
hour before I realized that I accidentally forget to send out patch #4 
and #5.


And yes what you describe is exactly what my patch #4 is doing and patch 
#5 is then a prove of concept to use the new interface for evictions of 
shared BOs.


Regards,
Christian.

Am 05.12.19 um 02:27 schrieb Felix Kuehling:

[adding the mailing list back]

Christian sent a 3-patch series that I just reviewed and commented on. 
That removes the need to lock the page directory reservation for 
updating page tables. That solves a part of the problem, but not all 
of it.


We still need to implement that driver lock for HMM, and add a 
vm_resident flag in the vm structure that could be checked in the new 
amdgpu_vm_evictable function Christian added. That function still runs 
in atomic context (under the ttm_bo_glob.lru_lock spinlock). So we 
can't lock a mutex there. But we can use trylock and return "not 
evictable" if we fail to take the mutex.


If we take the mutex successfully, we can update the vm_resident flag 
to false and allow the eviction.


The rest should work just the way I proposed earlier.

Regards,
  Felix

On 2019-12-04 8:01 p.m., Sierra Guiza, Alejandro (Alex) wrote:

[AMD Official Use Only - Internal Distribution Only]

Hi Christian,
I wonder if you had have time to check on this implementation?
Please let me know if there's something that I could help you with

Regards,
Alex

-Original Message-
From: Christian König 
Sent: Friday, November 29, 2019 1:41 AM
To: Kuehling, Felix ; Koenig, Christian 
; Sierra Guiza, Alejandro (Alex) 


Cc: amd-gfx@lists.freedesktop.org
Subject: Re: Deadlock on PTEs update for HMM

[CAUTION: External Email]

Hi Felix,

yes that is exactly my thinking as well. Problem is that getting this 
to work was much harder than I thought.


We can't use a mutex cause TTM is calling the eviction callback in 
atomic context. A spinlock doesn't looked like a good idea either 
because we potentially need to wait for the hardware with a fixed IB 
pool.


Because of this I've started to rewrite the TTM handling to not call 
the driver in an atomic context any more, but that took me way longer 
than expected as well.


I'm currently experimenting with using a trylock driver mutex, that 
at least that should work for now until we got something better.


Regards,
Christian.

Am 28.11.19 um 21:30 schrieb Felix Kuehling:

Hi Christian,

I'm thinking about this problem, trying to come up with a solution.
The fundamental problem is that we need low-overhead access to the
page table in the MMU notifier, without much memory management or
locking.

There is one "driver lock" that we're supposed to take in the MMU
notifier as well as when we update page tables that is prescribed by
the HMM documentation (Documentation/vm/hmm.rst). I don't currently
see such a lock in amdgpu. We'll probably need to add that anyway,
with all the usual precautions about lock dependencies around MMU
notifiers. Then we could use that lock to protect page table residency
state, in addition to the reservation of the top-level page directory.

We don't want to block eviction of page tables unconditionally, so the
MMU notifier must be able to deal with the situation that page tables
are not resident at the moment. But the lock can delay page tables
from being evicted while an MMU notifier is in progress and protect us
from race conditions between MMU notifiers invalidating PTEs, and page
tables getting evicted.

amdgpu_vm_bo_invalidate could detect when a page table is being
evicted, and update a new "vm_resident" flag inside the amdgpu_vm
while holding the "HMM driver lock". If an MMU notifier is in
progress, trying to take the "HMM driver lock" will delay the eviction
long enough for any pending PTE invalidation to complete.

In the case that page tables are not resident (vm_resident flag is
false), it means the GPU is currently not accessing any memory in that
amdgpu_vm address space. So we don't need to invalidate the PTEs right
away. I think we could implement a deferred invalidation mechanism for
this case, that delays the invalidation until the next time the page
tables are made resident. amdgpu_amdkfd_gpuvm_restore_process_bos
would replay any deferred PTE invalidations after validating the page
tables but before restarting the user mode queues for the process. If
graphics ever implements page-fault-based memory management, you'd
need to do something similar in amdgpu_cs.

Once all that is in place, we should be able to update PTEs in MMU
notifiers without reserving the page tables.

If we use SDMA for updating page tables, we may need a pre-allocated
IB for use in MMU notifiers. And there is problably other details to
be worked out about exactly how we implement the PTE invalidation in
MMU notifiers and reflecting that in the state of the amdgpu_vm and
amdgpu_bo_va_mapping.

Doe

Re: Deadlock on PTEs update for HMM

2019-12-04 Thread Felix Kuehling

[adding the mailing list back]

Christian sent a 3-patch series that I just reviewed and commented on. 
That removes the need to lock the page directory reservation for 
updating page tables. That solves a part of the problem, but not all of it.


We still need to implement that driver lock for HMM, and add a 
vm_resident flag in the vm structure that could be checked in the new 
amdgpu_vm_evictable function Christian added. That function still runs 
in atomic context (under the ttm_bo_glob.lru_lock spinlock). So we can't 
lock a mutex there. But we can use trylock and return "not evictable" if 
we fail to take the mutex.


If we take the mutex successfully, we can update the vm_resident flag to 
false and allow the eviction.


The rest should work just the way I proposed earlier.

Regards,
  Felix

On 2019-12-04 8:01 p.m., Sierra Guiza, Alejandro (Alex) wrote:

[AMD Official Use Only - Internal Distribution Only]

Hi Christian,
I wonder if you had have time to check on this implementation?
Please let me know if there's something that I could help you with

Regards,
Alex

-Original Message-
From: Christian König 
Sent: Friday, November 29, 2019 1:41 AM
To: Kuehling, Felix ; Koenig, Christian 
; Sierra Guiza, Alejandro (Alex) 
Cc: amd-gfx@lists.freedesktop.org
Subject: Re: Deadlock on PTEs update for HMM

[CAUTION: External Email]

Hi Felix,

yes that is exactly my thinking as well. Problem is that getting this to work 
was much harder than I thought.

We can't use a mutex cause TTM is calling the eviction callback in atomic 
context. A spinlock doesn't looked like a good idea either because we 
potentially need to wait for the hardware with a fixed IB pool.

Because of this I've started to rewrite the TTM handling to not call the driver 
in an atomic context any more, but that took me way longer than expected as 
well.

I'm currently experimenting with using a trylock driver mutex, that at least 
that should work for now until we got something better.

Regards,
Christian.

Am 28.11.19 um 21:30 schrieb Felix Kuehling:

Hi Christian,

I'm thinking about this problem, trying to come up with a solution.
The fundamental problem is that we need low-overhead access to the
page table in the MMU notifier, without much memory management or
locking.

There is one "driver lock" that we're supposed to take in the MMU
notifier as well as when we update page tables that is prescribed by
the HMM documentation (Documentation/vm/hmm.rst). I don't currently
see such a lock in amdgpu. We'll probably need to add that anyway,
with all the usual precautions about lock dependencies around MMU
notifiers. Then we could use that lock to protect page table residency
state, in addition to the reservation of the top-level page directory.

We don't want to block eviction of page tables unconditionally, so the
MMU notifier must be able to deal with the situation that page tables
are not resident at the moment. But the lock can delay page tables
from being evicted while an MMU notifier is in progress and protect us
from race conditions between MMU notifiers invalidating PTEs, and page
tables getting evicted.

amdgpu_vm_bo_invalidate could detect when a page table is being
evicted, and update a new "vm_resident" flag inside the amdgpu_vm
while holding the "HMM driver lock". If an MMU notifier is in
progress, trying to take the "HMM driver lock" will delay the eviction
long enough for any pending PTE invalidation to complete.

In the case that page tables are not resident (vm_resident flag is
false), it means the GPU is currently not accessing any memory in that
amdgpu_vm address space. So we don't need to invalidate the PTEs right
away. I think we could implement a deferred invalidation mechanism for
this case, that delays the invalidation until the next time the page
tables are made resident. amdgpu_amdkfd_gpuvm_restore_process_bos
would replay any deferred PTE invalidations after validating the page
tables but before restarting the user mode queues for the process. If
graphics ever implements page-fault-based memory management, you'd
need to do something similar in amdgpu_cs.

Once all that is in place, we should be able to update PTEs in MMU
notifiers without reserving the page tables.

If we use SDMA for updating page tables, we may need a pre-allocated
IB for use in MMU notifiers. And there is problably other details to
be worked out about exactly how we implement the PTE invalidation in
MMU notifiers and reflecting that in the state of the amdgpu_vm and
amdgpu_bo_va_mapping.

Does this idea sound reasonable to you? Can you think of a simpler
solution?

Thanks,
   Felix

On 2019-11-27 10:02 a.m., Christian König wrote:

Hi Alejandro,

yes I'm very aware of this issue, but unfortunately can't give an
easy solution either.

I'm working for over a year now on getting this fixed, but
unfortunately it turned out that this problem is much bigger than
initially thought.

Settin

Re: Deadlock on PTEs update for HMM

2019-11-29 Thread Philip Yang
Yes, this can work using the same way as dqm_lock. This is trivial part, 
Felix and Christian is discussing the solution of lock problem.


Regards,
Philip

On 2019-11-28 7:35 p.m., Zeng, Oak wrote:

[AMD Official Use Only - Internal Distribution Only]

Is kmalloc with GFP_NOWAIT an option here?

Regards,

Oak

*From:* amd-gfx  *On Behalf Of * 
Sierra Guiza, Alejandro (Alex)

*Sent:* Wednesday, November 27, 2019 9:55 AM
*To:* Koenig, Christian ; Kuehling, Felix 


*Cc:* amd-gfx@lists.freedesktop.org
*Subject:* Deadlock on PTEs update for HMM

Hi Christian,

As you know, we’re working on the HMM enablement. Im working on the dGPU 
page table entries invalidation on the userptr mapping case. Currently, 
the MMU notifiers handle stops all user mode queues, schedule a delayed 
worker to re-validate userptr mappings and restart the queues.


Part of the HMM functionality, we need to invalidate the page table 
entries instead of stopping the queues. At the same time we need to move 
the revalidation of the userptr mappings into the page fault handler.


We’re seeing a deadlock warning after we try to invalidate the PTEs 
inside the MMU notifier handler. More specific, when we try to update 
the BOs to invalidate PTEs using amdgpu_vm_bo_update. This uses kmalloc 
on the amdgpu_job_alloc which seems to be causing this problem.


Based on @Kuehling, Felix <mailto:felix.kuehl...@amd.com> comments, 
kmalloc without any special flags can cause memory reclaim. Doing that 
inside an MMU notifier is problematic, because an MMU notifier may be 
called inside a memory-reclaim operation itself. That would result in 
recursion. Also, reclaim shouldn't be done while holding a lock that can 
be taken in an MMU notifier for the same reason. If you cause a reclaim 
while holding that lock, then an MMU notifier called by the reclaim 
can deadlock trying to take the same lock.


Please let us know if you have any advice to enable this the right way

Thanks in advanced,

Alejandro


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cphilip.yang%40amd.com%7Cff5def1cf2de44a6bca608d77464043e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637105845255754611sdata=Sz3LnrlJ8E56eftV3YCh6YdT6nNlMeaA5JFpDtKBPkc%3Dreserved=0


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: Deadlock on PTEs update for HMM

2019-11-28 Thread Christian König

Hi Felix,

yes that is exactly my thinking as well. Problem is that getting this to 
work was much harder than I thought.


We can't use a mutex cause TTM is calling the eviction callback in 
atomic context. A spinlock doesn't looked like a good idea either 
because we potentially need to wait for the hardware with a fixed IB pool.


Because of this I've started to rewrite the TTM handling to not call the 
driver in an atomic context any more, but that took me way longer than 
expected as well.


I'm currently experimenting with using a trylock driver mutex, that at 
least that should work for now until we got something better.


Regards,
Christian.

Am 28.11.19 um 21:30 schrieb Felix Kuehling:

Hi Christian,

I'm thinking about this problem, trying to come up with a solution. 
The fundamental problem is that we need low-overhead access to the 
page table in the MMU notifier, without much memory management or 
locking.


There is one "driver lock" that we're supposed to take in the MMU 
notifier as well as when we update page tables that is prescribed by 
the HMM documentation (Documentation/vm/hmm.rst). I don't currently 
see such a lock in amdgpu. We'll probably need to add that anyway, 
with all the usual precautions about lock dependencies around MMU 
notifiers. Then we could use that lock to protect page table residency 
state, in addition to the reservation of the top-level page directory.


We don't want to block eviction of page tables unconditionally, so the 
MMU notifier must be able to deal with the situation that page tables 
are not resident at the moment. But the lock can delay page tables 
from being evicted while an MMU notifier is in progress and protect us 
from race conditions between MMU notifiers invalidating PTEs, and page 
tables getting evicted.


amdgpu_vm_bo_invalidate could detect when a page table is being 
evicted, and update a new "vm_resident" flag inside the amdgpu_vm 
while holding the "HMM driver lock". If an MMU notifier is in 
progress, trying to take the "HMM driver lock" will delay the eviction 
long enough for any pending PTE invalidation to complete.


In the case that page tables are not resident (vm_resident flag is 
false), it means the GPU is currently not accessing any memory in that 
amdgpu_vm address space. So we don't need to invalidate the PTEs right 
away. I think we could implement a deferred invalidation mechanism for 
this case, that delays the invalidation until the next time the page 
tables are made resident. amdgpu_amdkfd_gpuvm_restore_process_bos 
would replay any deferred PTE invalidations after validating the page 
tables but before restarting the user mode queues for the process. If 
graphics ever implements page-fault-based memory management, you'd 
need to do something similar in amdgpu_cs.


Once all that is in place, we should be able to update PTEs in MMU 
notifiers without reserving the page tables.


If we use SDMA for updating page tables, we may need a pre-allocated 
IB for use in MMU notifiers. And there is problably other details to 
be worked out about exactly how we implement the PTE invalidation in 
MMU notifiers and reflecting that in the state of the amdgpu_vm and 
amdgpu_bo_va_mapping.


Does this idea sound reasonable to you? Can you think of a simpler 
solution?


Thanks,
  Felix

On 2019-11-27 10:02 a.m., Christian König wrote:

Hi Alejandro,

yes I'm very aware of this issue, but unfortunately can't give an 
easy solution either.


I'm working for over a year now on getting this fixed, but 
unfortunately it turned out that this problem is much bigger than 
initially thought.


Setting the appropriate GFP flags for the job allocation is actually 
the trivial part.


The really really hard thing is that we somehow need to add a lock to 
prevent the page tables from being evicted. And as you also figured 
out that lock can't be taken easily anywhere else.


I've already wrote a prototype for this, but didn't had time to 
hammer it into shape for upstreaming yet.


Regards,
Christian.

Am 27.11.19 um 15:55 schrieb Sierra Guiza, Alejandro (Alex):


Hi Christian,

As you know, we’re working on the HMM enablement. Im working on the 
dGPU page table entries invalidation on the userptr mapping case. 
Currently, the MMU notifiers handle stops all user mode queues, 
schedule a delayed worker to re-validate userptr mappings and 
restart the queues.


Part of the HMM functionality, we need to invalidate the page table 
entries instead of stopping the queues. At the same time we need to 
move the revalidation of the userptr mappings into the page fault 
handler.


We’re seeing a deadlock warning after we try to invalidate the PTEs 
inside the MMU notifier handler. More specific, when we try to 
update the BOs to invalidate PTEs using amdgpu_vm_bo_update. This 
uses kmalloc on the amdgpu_job_alloc which seems to be causing this 
problem.


Based on @Kuehling, Felix  comments, 
kmalloc without any special 

RE: Deadlock on PTEs update for HMM

2019-11-28 Thread Zeng, Oak
[AMD Official Use Only - Internal Distribution Only]

Is kmalloc with GFP_NOWAIT an option here?

Regards,
Oak

From: amd-gfx  On Behalf Of Sierra 
Guiza, Alejandro (Alex)
Sent: Wednesday, November 27, 2019 9:55 AM
To: Koenig, Christian ; Kuehling, Felix 

Cc: amd-gfx@lists.freedesktop.org
Subject: Deadlock on PTEs update for HMM

Hi Christian,
As you know, we're working on the HMM enablement. Im working on the dGPU page 
table entries invalidation on the userptr mapping case. Currently, the MMU 
notifiers handle stops all user mode queues, schedule a delayed worker to 
re-validate userptr mappings and restart the queues.
Part of the HMM functionality, we need to invalidate the page table entries 
instead of stopping the queues. At the same time we need to move the 
revalidation of the userptr mappings into the page fault handler.
We're seeing a deadlock warning after we try to invalidate the PTEs inside the 
MMU notifier handler. More specific, when we try to update the BOs to 
invalidate PTEs using amdgpu_vm_bo_update. This uses kmalloc on the 
amdgpu_job_alloc which seems to be causing this problem.
Based on @Kuehling, Felix<mailto:felix.kuehl...@amd.com> comments, kmalloc 
without any special flags can cause memory reclaim. Doing that inside an MMU 
notifier is problematic, because an MMU notifier may be called inside a 
memory-reclaim operation itself. That would result in recursion. Also, reclaim 
shouldn't be done while holding a lock that can be taken in an MMU notifier for 
the same reason. If you cause a reclaim while holding that lock, then an MMU 
notifier called by the reclaim can deadlock trying to take the same lock.
Please let us know if you have any advice to enable this the right way

Thanks in advanced,
Alejandro

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: Deadlock on PTEs update for HMM

2019-11-28 Thread Felix Kuehling

Hi Christian,

I'm thinking about this problem, trying to come up with a solution. The 
fundamental problem is that we need low-overhead access to the page 
table in the MMU notifier, without much memory management or locking.


There is one "driver lock" that we're supposed to take in the MMU 
notifier as well as when we update page tables that is prescribed by the 
HMM documentation (Documentation/vm/hmm.rst). I don't currently see such 
a lock in amdgpu. We'll probably need to add that anyway, with all the 
usual precautions about lock dependencies around MMU notifiers. Then we 
could use that lock to protect page table residency state, in addition 
to the reservation of the top-level page directory.


We don't want to block eviction of page tables unconditionally, so the 
MMU notifier must be able to deal with the situation that page tables 
are not resident at the moment. But the lock can delay page tables from 
being evicted while an MMU notifier is in progress and protect us from 
race conditions between MMU notifiers invalidating PTEs, and page tables 
getting evicted.


amdgpu_vm_bo_invalidate could detect when a page table is being evicted, 
and update a new "vm_resident" flag inside the amdgpu_vm while holding 
the "HMM driver lock". If an MMU notifier is in progress, trying to take 
the "HMM driver lock" will delay the eviction long enough for any 
pending PTE invalidation to complete.


In the case that page tables are not resident (vm_resident flag is 
false), it means the GPU is currently not accessing any memory in that 
amdgpu_vm address space. So we don't need to invalidate the PTEs right 
away. I think we could implement a deferred invalidation mechanism for 
this case, that delays the invalidation until the next time the page 
tables are made resident. amdgpu_amdkfd_gpuvm_restore_process_bos would 
replay any deferred PTE invalidations after validating the page tables 
but before restarting the user mode queues for the process. If graphics 
ever implements page-fault-based memory management, you'd need to do 
something similar in amdgpu_cs.


Once all that is in place, we should be able to update PTEs in MMU 
notifiers without reserving the page tables.


If we use SDMA for updating page tables, we may need a pre-allocated IB 
for use in MMU notifiers. And there is problably other details to be 
worked out about exactly how we implement the PTE invalidation in MMU 
notifiers and reflecting that in the state of the amdgpu_vm and 
amdgpu_bo_va_mapping.


Does this idea sound reasonable to you? Can you think of a simpler solution?

Thanks,
  Felix

On 2019-11-27 10:02 a.m., Christian König wrote:

Hi Alejandro,

yes I'm very aware of this issue, but unfortunately can't give an easy 
solution either.


I'm working for over a year now on getting this fixed, but 
unfortunately it turned out that this problem is much bigger than 
initially thought.


Setting the appropriate GFP flags for the job allocation is actually 
the trivial part.


The really really hard thing is that we somehow need to add a lock to 
prevent the page tables from being evicted. And as you also figured 
out that lock can't be taken easily anywhere else.


I've already wrote a prototype for this, but didn't had time to hammer 
it into shape for upstreaming yet.


Regards,
Christian.

Am 27.11.19 um 15:55 schrieb Sierra Guiza, Alejandro (Alex):


Hi Christian,

As you know, we’re working on the HMM enablement. Im working on the 
dGPU page table entries invalidation on the userptr mapping case. 
Currently, the MMU notifiers handle stops all user mode queues, 
schedule a delayed worker to re-validate userptr mappings and restart 
the queues.


Part of the HMM functionality, we need to invalidate the page table 
entries instead of stopping the queues. At the same time we need to 
move the revalidation of the userptr mappings into the page fault 
handler.


We’re seeing a deadlock warning after we try to invalidate the PTEs 
inside the MMU notifier handler. More specific, when we try to update 
the BOs to invalidate PTEs using amdgpu_vm_bo_update. This uses 
kmalloc on the amdgpu_job_alloc which seems to be causing this problem.


Based on @Kuehling, Felix  comments, 
kmalloc without any special flags can cause memory reclaim. Doing 
that inside an MMU notifier is problematic, because an MMU notifier 
may be called inside a memory-reclaim operation itself. That would 
result in recursion. Also, reclaim shouldn't be done while holding a 
lock that can be taken in an MMU notifier for the same reason. If you 
cause a reclaim while holding that lock, then an MMU notifier called 
by the reclaim can deadlock trying to take the same lock.


Please let us know if you have any advice to enable this the right way

Thanks in advanced,

Alejandro




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: Deadlock on PTEs update for HMM

2019-11-27 Thread Christian König

Hi Alejandro,

yes I'm very aware of this issue, but unfortunately can't give an easy 
solution either.


I'm working for over a year now on getting this fixed, but unfortunately 
it turned out that this problem is much bigger than initially thought.


Setting the appropriate GFP flags for the job allocation is actually the 
trivial part.


The really really hard thing is that we somehow need to add a lock to 
prevent the page tables from being evicted. And as you also figured out 
that lock can't be taken easily anywhere else.


I've already wrote a prototype for this, but didn't had time to hammer 
it into shape for upstreaming yet.


Regards,
Christian.

Am 27.11.19 um 15:55 schrieb Sierra Guiza, Alejandro (Alex):


Hi Christian,

As you know, we’re working on the HMM enablement. Im working on the 
dGPU page table entries invalidation on the userptr mapping case. 
Currently, the MMU notifiers handle stops all user mode queues, 
schedule a delayed worker to re-validate userptr mappings and restart 
the queues.


Part of the HMM functionality, we need to invalidate the page table 
entries instead of stopping the queues. At the same time we need to 
move the revalidation of the userptr mappings into the page fault handler.


We’re seeing a deadlock warning after we try to invalidate the PTEs 
inside the MMU notifier handler. More specific, when we try to update 
the BOs to invalidate PTEs using amdgpu_vm_bo_update. This uses 
kmalloc on the amdgpu_job_alloc which seems to be causing this problem.


Based on @Kuehling, Felix  comments, 
kmalloc without any special flags can cause memory reclaim. Doing that 
inside an MMU notifier is problematic, because an MMU notifier may be 
called inside a memory-reclaim operation itself. That would result in 
recursion. Also, reclaim shouldn't be done while holding a lock that 
can be taken in an MMU notifier for the same reason. If you cause a 
reclaim while holding that lock, then an MMU notifier called by the 
reclaim can deadlock trying to take the same lock.


Please let us know if you have any advice to enable this the right way

Thanks in advanced,

Alejandro



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Deadlock on PTEs update for HMM

2019-11-27 Thread Sierra Guiza, Alejandro (Alex)
Hi Christian,
As you know, we're working on the HMM enablement. Im working on the dGPU page 
table entries invalidation on the userptr mapping case. Currently, the MMU 
notifiers handle stops all user mode queues, schedule a delayed worker to 
re-validate userptr mappings and restart the queues.
Part of the HMM functionality, we need to invalidate the page table entries 
instead of stopping the queues. At the same time we need to move the 
revalidation of the userptr mappings into the page fault handler.
We're seeing a deadlock warning after we try to invalidate the PTEs inside the 
MMU notifier handler. More specific, when we try to update the BOs to 
invalidate PTEs using amdgpu_vm_bo_update. This uses kmalloc on the 
amdgpu_job_alloc which seems to be causing this problem.
Based on @Kuehling, Felix comments, kmalloc 
without any special flags can cause memory reclaim. Doing that inside an MMU 
notifier is problematic, because an MMU notifier may be called inside a 
memory-reclaim operation itself. That would result in recursion. Also, reclaim 
shouldn't be done while holding a lock that can be taken in an MMU notifier for 
the same reason. If you cause a reclaim while holding that lock, then an MMU 
notifier called by the reclaim can deadlock trying to take the same lock.
Please let us know if you have any advice to enable this the right way

Thanks in advanced,
Alejandro

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx