Re: Explicit VM updates

2022-06-02 Thread Christian König

Am 02.06.22 um 16:21 schrieb Felix Kuehling:

[SNIP]


In other words the free is not waiting for the unmap to complete, 
but causes command submissions through the kernel to depend on 
the unmap.


I guess I don't understand that dependency. The next command 
submission obviously cannot use the memory that was unmapped. But 
why does it need to synchronize with the unmap operation?


Because of the necessary TLB flush, only after that one is executed 
we can be sure that nobody has access to the memory any more and 
actually free it.


So freeing the memory has to wait for the TLB flush. Why does the 
next command submission need to wait?


Because that's the one triggering the TLB flush. The issue is that 
flushing the TLB while the VMID is in use is really unreliable on 
most hardware generations.


It's been working well enough with ROCm. With user mode command 
submission there is no way to block GPU work while a TLB flush is in 
progress.


Yeah, but at least on Navi 1x that's so horrible broken that the SDMA 
could write anywhere when we would try this.






User mode submissions are completely unrelated to that.


I mention user mode command submission because there is no way to 
enforce the synchronization you describe here on a user mode 
queue. So this approach is not very future proof.


With user mode queues you need to wait for the work on the queue to 
finish anyway or otherwise you run into VM faults if you just unmap 
or free the memory.


If the next command submission doesn't use the unmapped/freed 
memory, why does it need to wait for the TLB flush?


Because it could potentially use it. When userspace lies to the 
kernel and still accesses the mapping we would allow access to freed 
up memory and create a major security problem.


I'm aware of the potential security problem. That's why I'm 
recommending you don't actually free the memory until the TLB flush is 
done. So a bogus access will either harmlessly access memory that's 
not freed yet, or it will VM fault. It will never access memory that's 
already freed and potentially allocated by someone else.


Yes, that's the idea. The question is just when we can do the TLB flush.

If it is using the unmapped/freed memory, that's a user mode bug. 
But waiting for the TLB flush won't fix that. It will only turn a 
likely VM fault into a certain VM fault.


Yeah, exactly that's the intention here.



The guarantee you need to give is, that the memory is not freed and 
reused by anyone else until the TLB flush is done. This dependency 
requires synchronization of the "free" operation with the TLB flush. 
It does not require synchronization with any future command 
submissions in the context that freed the memory.


See above, the future command submission is what triggers the TLB 
flush because only then we can easily execute it without to much hassle.


That seems to be a limitation of your current command submission 
model. User mode command submission will not be able to trigger a TLB 
flush. Unmapping or freeing memory should be the trigger in that case.


That's how it works with KFD. That said, our TLB flushes aren't as 
well pipelined (which could probably be improved), and your strategy 
can probably batch more TLB flushes, so I see where you're coming from.


Well the mapping/unmapping IOCTL should certainly trigger the TLB 
flushes for the user mode queues, but as I said this is completely 
independent to this here.


The limitation is on the kernel CS IOCTL, not the VM IOCTL. So that is 
completely unrelated to this.




For Vega and Navi 2x we could use async TLB flushes and on gfx6, gfx7 
and gfx8 we could use double TLB flushes with grace time, but Navi 1x 
is so horrible broken regarding this that I don't see how else we 
could do that.


We're using heavy-weight TLB flushes on SOC15 GPUs. On Vega20 with 
XGMI we need double flushes to be safe.


I'm raising my concerns because I don't think making user mode wait is 
a good strategy long-term. And I believe this explicit sync and 
explicit VM update should be designed with an eye for future user-mode 
command submission models.


Yeah, but as already discussed with Daniel and Jason that will never 
ever work correctly. IOCTLs can't depend on user mode queues in any way. 
So user space can only block or rather call the map, unmap, free 
functions at the right time.


If you need short-term workarounds for broken hardware, that's another 
issue. But it would be good if that could be kept out of the API.


Well as I said that is completely unrelated to user mode queues. The 
restriction is on the CS API, not the VM API.


Regards,
Christian.




Regards,
  Felix




Regards,
Christian.



Regards,
  Felix




The signal that TLB flush is completed comes from the MES in this 
case.


Regards,
Christian.



Regards,
  Felix




Regards,
Christian.



Regards,
  Felix




3. All VM operations requested by userspace will still be 
executed in order, e.g. we can't run unmap + map 

Re: Explicit VM updates

2022-06-02 Thread Felix Kuehling



Am 2022-06-02 um 02:47 schrieb Christian König:



Am 01.06.22 um 19:39 schrieb Felix Kuehling:


Am 2022-06-01 um 13:22 schrieb Christian König:

Am 01.06.22 um 19:07 schrieb Felix Kuehling:

Am 2022-06-01 um 12:29 schrieb Christian König:

Am 01.06.22 um 17:05 schrieb Felix Kuehling:


Am 2022-06-01 um 08:40 schrieb Christian König:

Hey guys,

so today Bas came up with a new requirement regarding the 
explicit synchronization to VM updates and a bunch of prototype 
patches.


I've been thinking about that stuff for quite some time before, 
but to be honest it's one of the most trickiest parts of the 
driver.


So my current thinking is that we could potentially handle those 
requirements like this:


1. We add some new EXPLICIT flag to context (or CS?) and VM 
IOCTL. This way we either get the new behavior for the whole 
CS+VM or the old one, but never both mixed.


2. When memory is unmapped we keep around the last unmap 
operation inside the bo_va.


3. When memory is freed we add all the CS fences which could 
access that memory + the last unmap operation as BOOKKEEP fences 
to the BO and as mandatory sync fence to the VM.


Memory freed either because of an eviction or because of 
userspace closing the handle will be seen as a combination of 
unmap+free.



The result is the following semantic for userspace to avoid 
implicit synchronization as much as possible:


1. When you allocate and map memory it is mandatory to either 
wait for the mapping operation to complete or to add it as 
dependency for your CS.
    If this isn't followed the application will run into CS 
faults (that's what we pretty much already implemented).


This makes sense.




2. When memory is freed you must unmap that memory first and 
then wait for this unmap operation to complete before freeing 
the memory.
    If this isn't followed the kernel will add a forcefully wait 
to the next CS to block until the unmap is completed.


This would work for now, but it won't work for user mode 
submission in the future. I find it weird that user mode needs to 
wait for the unmap. For user mode, unmap and free should always 
be asynchronous. I can't think of any good reasons to make user 
mode wait for the driver to clean up its stuff.


Could the waiting be done in kernel mode instead? TTM already 
does delayed freeing if there are fences outstanding on a BO 
being freed. This should make it easy to delay freeing until the 
unmap is done without blocking the user mode thread.


This is not about blocking, but synchronization dependencies.


Then I must have misunderstood this sentence: "When memory is freed 
you must unmap that memory first and then wait for this unmap 
operation to complete before freeing the memory." If the pronoun 
"you" is the user mode driver, it means user mode must wait for 
kernel mode to finish unmapping memory before freeing it. Was that 
not what you meant?


Ah, yes. The UMD must wait for the kernel to finish unmapping all 
the maps from the BO before it drops the handle of the BO and with 
that frees it.




In other words the free is not waiting for the unmap to complete, 
but causes command submissions through the kernel to depend on the 
unmap.


I guess I don't understand that dependency. The next command 
submission obviously cannot use the memory that was unmapped. But 
why does it need to synchronize with the unmap operation?


Because of the necessary TLB flush, only after that one is executed 
we can be sure that nobody has access to the memory any more and 
actually free it.


So freeing the memory has to wait for the TLB flush. Why does the 
next command submission need to wait?


Because that's the one triggering the TLB flush. The issue is that 
flushing the TLB while the VMID is in use is really unreliable on most 
hardware generations.


It's been working well enough with ROCm. With user mode command 
submission there is no way to block GPU work while a TLB flush is in 
progress.







User mode submissions are completely unrelated to that.


I mention user mode command submission because there is no way to 
enforce the synchronization you describe here on a user mode queue. 
So this approach is not very future proof.


With user mode queues you need to wait for the work on the queue to 
finish anyway or otherwise you run into VM faults if you just unmap 
or free the memory.


If the next command submission doesn't use the unmapped/freed memory, 
why does it need to wait for the TLB flush?


Because it could potentially use it. When userspace lies to the kernel 
and still accesses the mapping we would allow access to freed up 
memory and create a major security problem.


I'm aware of the potential security problem. That's why I'm recommending 
you don't actually free the memory until the TLB flush is done. So a 
bogus access will either harmlessly access memory that's not freed yet, 
or it will VM fault. It will never access memory that's already freed 
and potentially allocated by 

Re: Explicit VM updates

2022-06-02 Thread Christian König

Am 01.06.22 um 19:47 schrieb Bas Nieuwenhuizen:

On Wed, Jun 1, 2022 at 2:40 PM Christian König  wrote:

Hey guys,

so today Bas came up with a new requirement regarding the explicit
synchronization to VM updates and a bunch of prototype patches.

I've been thinking about that stuff for quite some time before, but to
be honest it's one of the most trickiest parts of the driver.

So my current thinking is that we could potentially handle those
requirements like this:

1. We add some new EXPLICIT flag to context (or CS?) and VM IOCTL. This
way we either get the new behavior for the whole CS+VM or the old one,
but never both mixed.

Any VM wide flag would be a concern because libdrm_amdgpu merges
things into a single drm fd for potentially multiple driver (i.e. radv
+ amdvlk + radeonsi). I'm also not sure what you'd need a VM flag for?


What I meant was a flag for the VM IOCTL. E.g. you specify on the IOCTL 
if you want IMPLICIT or EXPLICIT synchronization.


Maybe just specifying a drm_syncobj could be used as indicator that we 
want explicit synchronization for a VM IOCTL operation as well. That's 
what I used in the prototype, but I'm not 100% sure if that's sufficient.


Regards,
Christian.




2. When memory is unmapped we keep around the last unmap operation
inside the bo_va.

3. When memory is freed we add all the CS fences which could access that
memory + the last unmap operation as BOOKKEEP fences to the BO and as
mandatory sync fence to the VM.

Memory freed either because of an eviction or because of userspace
closing the handle will be seen as a combination of unmap+free.


The result is the following semantic for userspace to avoid implicit
synchronization as much as possible:

1. When you allocate and map memory it is mandatory to either wait for
the mapping operation to complete or to add it as dependency for your CS.
  If this isn't followed the application will run into CS faults
(that's what we pretty much already implemented).

2. When memory is freed you must unmap that memory first and then wait
for this unmap operation to complete before freeing the memory.
  If this isn't followed the kernel will add a forcefully wait to the
next CS to block until the unmap is completed.

3. All VM operations requested by userspace will still be executed in
order, e.g. we can't run unmap + map in parallel or something like this.

Is that something you guys can live with? As far as I can see it should
give you the maximum freedom possible, but is still doable.

Regards,
Christian.




Re: Explicit VM updates

2022-06-02 Thread Christian König




Am 01.06.22 um 19:39 schrieb Felix Kuehling:


Am 2022-06-01 um 13:22 schrieb Christian König:

Am 01.06.22 um 19:07 schrieb Felix Kuehling:

Am 2022-06-01 um 12:29 schrieb Christian König:

Am 01.06.22 um 17:05 schrieb Felix Kuehling:


Am 2022-06-01 um 08:40 schrieb Christian König:

Hey guys,

so today Bas came up with a new requirement regarding the 
explicit synchronization to VM updates and a bunch of prototype 
patches.


I've been thinking about that stuff for quite some time before, 
but to be honest it's one of the most trickiest parts of the driver.


So my current thinking is that we could potentially handle those 
requirements like this:


1. We add some new EXPLICIT flag to context (or CS?) and VM 
IOCTL. This way we either get the new behavior for the whole 
CS+VM or the old one, but never both mixed.


2. When memory is unmapped we keep around the last unmap 
operation inside the bo_va.


3. When memory is freed we add all the CS fences which could 
access that memory + the last unmap operation as BOOKKEEP fences 
to the BO and as mandatory sync fence to the VM.


Memory freed either because of an eviction or because of 
userspace closing the handle will be seen as a combination of 
unmap+free.



The result is the following semantic for userspace to avoid 
implicit synchronization as much as possible:


1. When you allocate and map memory it is mandatory to either 
wait for the mapping operation to complete or to add it as 
dependency for your CS.
    If this isn't followed the application will run into CS 
faults (that's what we pretty much already implemented).


This makes sense.




2. When memory is freed you must unmap that memory first and then 
wait for this unmap operation to complete before freeing the memory.
    If this isn't followed the kernel will add a forcefully wait 
to the next CS to block until the unmap is completed.


This would work for now, but it won't work for user mode 
submission in the future. I find it weird that user mode needs to 
wait for the unmap. For user mode, unmap and free should always be 
asynchronous. I can't think of any good reasons to make user mode 
wait for the driver to clean up its stuff.


Could the waiting be done in kernel mode instead? TTM already does 
delayed freeing if there are fences outstanding on a BO being 
freed. This should make it easy to delay freeing until the unmap 
is done without blocking the user mode thread.


This is not about blocking, but synchronization dependencies.


Then I must have misunderstood this sentence: "When memory is freed 
you must unmap that memory first and then wait for this unmap 
operation to complete before freeing the memory." If the pronoun 
"you" is the user mode driver, it means user mode must wait for 
kernel mode to finish unmapping memory before freeing it. Was that 
not what you meant?


Ah, yes. The UMD must wait for the kernel to finish unmapping all the 
maps from the BO before it drops the handle of the BO and with that 
frees it.




In other words the free is not waiting for the unmap to complete, 
but causes command submissions through the kernel to depend on the 
unmap.


I guess I don't understand that dependency. The next command 
submission obviously cannot use the memory that was unmapped. But 
why does it need to synchronize with the unmap operation?


Because of the necessary TLB flush, only after that one is executed 
we can be sure that nobody has access to the memory any more and 
actually free it.


So freeing the memory has to wait for the TLB flush. Why does the next 
command submission need to wait?


Because that's the one triggering the TLB flush. The issue is that 
flushing the TLB while the VMID is in use is really unreliable on most 
hardware generations.




User mode submissions are completely unrelated to that.


I mention user mode command submission because there is no way to 
enforce the synchronization you describe here on a user mode queue. 
So this approach is not very future proof.


With user mode queues you need to wait for the work on the queue to 
finish anyway or otherwise you run into VM faults if you just unmap 
or free the memory.


If the next command submission doesn't use the unmapped/freed memory, 
why does it need to wait for the TLB flush?


Because it could potentially use it. When userspace lies to the kernel 
and still accesses the mapping we would allow access to freed up memory 
and create a major security problem.




If it is using the unmapped/freed memory, that's a user mode bug. But 
waiting for the TLB flush won't fix that. It will only turn a likely 
VM fault into a certain VM fault.


Yeah, exactly that's the intention here.



The guarantee you need to give is, that the memory is not freed and 
reused by anyone else until the TLB flush is done. This dependency 
requires synchronization of the "free" operation with the TLB flush. 
It does not require synchronization with any future command 
submissions in the context that 

Re: Explicit VM updates

2022-06-01 Thread Bas Nieuwenhuizen
On Wed, Jun 1, 2022 at 2:40 PM Christian König  wrote:
>
> Hey guys,
>
> so today Bas came up with a new requirement regarding the explicit
> synchronization to VM updates and a bunch of prototype patches.
>
> I've been thinking about that stuff for quite some time before, but to
> be honest it's one of the most trickiest parts of the driver.
>
> So my current thinking is that we could potentially handle those
> requirements like this:
>
> 1. We add some new EXPLICIT flag to context (or CS?) and VM IOCTL. This
> way we either get the new behavior for the whole CS+VM or the old one,
> but never both mixed.

Any VM wide flag would be a concern because libdrm_amdgpu merges
things into a single drm fd for potentially multiple driver (i.e. radv
+ amdvlk + radeonsi). I'm also not sure what you'd need a VM flag for?

>
> 2. When memory is unmapped we keep around the last unmap operation
> inside the bo_va.
>
> 3. When memory is freed we add all the CS fences which could access that
> memory + the last unmap operation as BOOKKEEP fences to the BO and as
> mandatory sync fence to the VM.
>
> Memory freed either because of an eviction or because of userspace
> closing the handle will be seen as a combination of unmap+free.
>
>
> The result is the following semantic for userspace to avoid implicit
> synchronization as much as possible:
>
> 1. When you allocate and map memory it is mandatory to either wait for
> the mapping operation to complete or to add it as dependency for your CS.
>  If this isn't followed the application will run into CS faults
> (that's what we pretty much already implemented).
>
> 2. When memory is freed you must unmap that memory first and then wait
> for this unmap operation to complete before freeing the memory.
>  If this isn't followed the kernel will add a forcefully wait to the
> next CS to block until the unmap is completed.
>
> 3. All VM operations requested by userspace will still be executed in
> order, e.g. we can't run unmap + map in parallel or something like this.
>
> Is that something you guys can live with? As far as I can see it should
> give you the maximum freedom possible, but is still doable.
>
> Regards,
> Christian.


Re: Explicit VM updates

2022-06-01 Thread Felix Kuehling



Am 2022-06-01 um 13:22 schrieb Christian König:

Am 01.06.22 um 19:07 schrieb Felix Kuehling:

Am 2022-06-01 um 12:29 schrieb Christian König:

Am 01.06.22 um 17:05 schrieb Felix Kuehling:


Am 2022-06-01 um 08:40 schrieb Christian König:

Hey guys,

so today Bas came up with a new requirement regarding the explicit 
synchronization to VM updates and a bunch of prototype patches.


I've been thinking about that stuff for quite some time before, 
but to be honest it's one of the most trickiest parts of the driver.


So my current thinking is that we could potentially handle those 
requirements like this:


1. We add some new EXPLICIT flag to context (or CS?) and VM IOCTL. 
This way we either get the new behavior for the whole CS+VM or the 
old one, but never both mixed.


2. When memory is unmapped we keep around the last unmap operation 
inside the bo_va.


3. When memory is freed we add all the CS fences which could 
access that memory + the last unmap operation as BOOKKEEP fences 
to the BO and as mandatory sync fence to the VM.


Memory freed either because of an eviction or because of userspace 
closing the handle will be seen as a combination of unmap+free.



The result is the following semantic for userspace to avoid 
implicit synchronization as much as possible:


1. When you allocate and map memory it is mandatory to either wait 
for the mapping operation to complete or to add it as dependency 
for your CS.
    If this isn't followed the application will run into CS faults 
(that's what we pretty much already implemented).


This makes sense.




2. When memory is freed you must unmap that memory first and then 
wait for this unmap operation to complete before freeing the memory.
    If this isn't followed the kernel will add a forcefully wait 
to the next CS to block until the unmap is completed.


This would work for now, but it won't work for user mode submission 
in the future. I find it weird that user mode needs to wait for the 
unmap. For user mode, unmap and free should always be asynchronous. 
I can't think of any good reasons to make user mode wait for the 
driver to clean up its stuff.


Could the waiting be done in kernel mode instead? TTM already does 
delayed freeing if there are fences outstanding on a BO being 
freed. This should make it easy to delay freeing until the unmap is 
done without blocking the user mode thread.


This is not about blocking, but synchronization dependencies.


Then I must have misunderstood this sentence: "When memory is freed 
you must unmap that memory first and then wait for this unmap 
operation to complete before freeing the memory." If the pronoun 
"you" is the user mode driver, it means user mode must wait for 
kernel mode to finish unmapping memory before freeing it. Was that 
not what you meant?


Ah, yes. The UMD must wait for the kernel to finish unmapping all the 
maps from the BO before it drops the handle of the BO and with that 
frees it.




In other words the free is not waiting for the unmap to complete, 
but causes command submissions through the kernel to depend on the 
unmap.


I guess I don't understand that dependency. The next command 
submission obviously cannot use the memory that was unmapped. But why 
does it need to synchronize with the unmap operation?


Because of the necessary TLB flush, only after that one is executed we 
can be sure that nobody has access to the memory any more and actually 
free it.


So freeing the memory has to wait for the TLB flush. Why does the next 
command submission need to wait?







User mode submissions are completely unrelated to that.


I mention user mode command submission because there is no way to 
enforce the synchronization you describe here on a user mode queue. 
So this approach is not very future proof.


With user mode queues you need to wait for the work on the queue to 
finish anyway or otherwise you run into VM faults if you just unmap or 
free the memory.


If the next command submission doesn't use the unmapped/freed memory, 
why does it need to wait for the TLB flush?


If it is using the unmapped/freed memory, that's a user mode bug. But 
waiting for the TLB flush won't fix that. It will only turn a likely VM 
fault into a certain VM fault.


The guarantee you need to give is, that the memory is not freed and 
reused by anyone else until the TLB flush is done. This dependency 
requires synchronization of the "free" operation with the TLB flush. It 
does not require synchronization with any future command submissions in 
the context that freed the memory.


Regards,
  Felix




The signal that TLB flush is completed comes from the MES in this case.

Regards,
Christian.



Regards,
  Felix




Regards,
Christian.



Regards,
  Felix




3. All VM operations requested by userspace will still be executed 
in order, e.g. we can't run unmap + map in parallel or something 
like this.


Is that something you guys can live with? As far as I can see it 
should give you the 

Re: Explicit VM updates

2022-06-01 Thread Christian König

Am 01.06.22 um 19:07 schrieb Felix Kuehling:

Am 2022-06-01 um 12:29 schrieb Christian König:

Am 01.06.22 um 17:05 schrieb Felix Kuehling:


Am 2022-06-01 um 08:40 schrieb Christian König:

Hey guys,

so today Bas came up with a new requirement regarding the explicit 
synchronization to VM updates and a bunch of prototype patches.


I've been thinking about that stuff for quite some time before, but 
to be honest it's one of the most trickiest parts of the driver.


So my current thinking is that we could potentially handle those 
requirements like this:


1. We add some new EXPLICIT flag to context (or CS?) and VM IOCTL. 
This way we either get the new behavior for the whole CS+VM or the 
old one, but never both mixed.


2. When memory is unmapped we keep around the last unmap operation 
inside the bo_va.


3. When memory is freed we add all the CS fences which could access 
that memory + the last unmap operation as BOOKKEEP fences to the BO 
and as mandatory sync fence to the VM.


Memory freed either because of an eviction or because of userspace 
closing the handle will be seen as a combination of unmap+free.



The result is the following semantic for userspace to avoid 
implicit synchronization as much as possible:


1. When you allocate and map memory it is mandatory to either wait 
for the mapping operation to complete or to add it as dependency 
for your CS.
    If this isn't followed the application will run into CS faults 
(that's what we pretty much already implemented).


This makes sense.




2. When memory is freed you must unmap that memory first and then 
wait for this unmap operation to complete before freeing the memory.
    If this isn't followed the kernel will add a forcefully wait to 
the next CS to block until the unmap is completed.


This would work for now, but it won't work for user mode submission 
in the future. I find it weird that user mode needs to wait for the 
unmap. For user mode, unmap and free should always be asynchronous. 
I can't think of any good reasons to make user mode wait for the 
driver to clean up its stuff.


Could the waiting be done in kernel mode instead? TTM already does 
delayed freeing if there are fences outstanding on a BO being freed. 
This should make it easy to delay freeing until the unmap is done 
without blocking the user mode thread.


This is not about blocking, but synchronization dependencies.


Then I must have misunderstood this sentence: "When memory is freed 
you must unmap that memory first and then wait for this unmap 
operation to complete before freeing the memory." If the pronoun "you" 
is the user mode driver, it means user mode must wait for kernel mode 
to finish unmapping memory before freeing it. Was that not what you 
meant?


Ah, yes. The UMD must wait for the kernel to finish unmapping all the 
maps from the BO before it drops the handle of the BO and with that 
frees it.




In other words the free is not waiting for the unmap to complete, but 
causes command submissions through the kernel to depend on the unmap.


I guess I don't understand that dependency. The next command 
submission obviously cannot use the memory that was unmapped. But why 
does it need to synchronize with the unmap operation?


Because of the necessary TLB flush, only after that one is executed we 
can be sure that nobody has access to the memory any more and actually 
free it.




User mode submissions are completely unrelated to that.


I mention user mode command submission because there is no way to 
enforce the synchronization you describe here on a user mode queue. So 
this approach is not very future proof.


With user mode queues you need to wait for the work on the queue to 
finish anyway or otherwise you run into VM faults if you just unmap or 
free the memory.


The signal that TLB flush is completed comes from the MES in this case.

Regards,
Christian.



Regards,
  Felix




Regards,
Christian.



Regards,
  Felix




3. All VM operations requested by userspace will still be executed 
in order, e.g. we can't run unmap + map in parallel or something 
like this.


Is that something you guys can live with? As far as I can see it 
should give you the maximum freedom possible, but is still doable.


Regards,
Christian.






Re: Explicit VM updates

2022-06-01 Thread Felix Kuehling

Am 2022-06-01 um 12:29 schrieb Christian König:

Am 01.06.22 um 17:05 schrieb Felix Kuehling:


Am 2022-06-01 um 08:40 schrieb Christian König:

Hey guys,

so today Bas came up with a new requirement regarding the explicit 
synchronization to VM updates and a bunch of prototype patches.


I've been thinking about that stuff for quite some time before, but 
to be honest it's one of the most trickiest parts of the driver.


So my current thinking is that we could potentially handle those 
requirements like this:


1. We add some new EXPLICIT flag to context (or CS?) and VM IOCTL. 
This way we either get the new behavior for the whole CS+VM or the 
old one, but never both mixed.


2. When memory is unmapped we keep around the last unmap operation 
inside the bo_va.


3. When memory is freed we add all the CS fences which could access 
that memory + the last unmap operation as BOOKKEEP fences to the BO 
and as mandatory sync fence to the VM.


Memory freed either because of an eviction or because of userspace 
closing the handle will be seen as a combination of unmap+free.



The result is the following semantic for userspace to avoid implicit 
synchronization as much as possible:


1. When you allocate and map memory it is mandatory to either wait 
for the mapping operation to complete or to add it as dependency for 
your CS.
    If this isn't followed the application will run into CS faults 
(that's what we pretty much already implemented).


This makes sense.




2. When memory is freed you must unmap that memory first and then 
wait for this unmap operation to complete before freeing the memory.
    If this isn't followed the kernel will add a forcefully wait to 
the next CS to block until the unmap is completed.


This would work for now, but it won't work for user mode submission 
in the future. I find it weird that user mode needs to wait for the 
unmap. For user mode, unmap and free should always be asynchronous. I 
can't think of any good reasons to make user mode wait for the driver 
to clean up its stuff.


Could the waiting be done in kernel mode instead? TTM already does 
delayed freeing if there are fences outstanding on a BO being freed. 
This should make it easy to delay freeing until the unmap is done 
without blocking the user mode thread.


This is not about blocking, but synchronization dependencies.


Then I must have misunderstood this sentence: "When memory is freed you 
must unmap that memory first and then wait for this unmap operation to 
complete before freeing the memory." If the pronoun "you" is the user 
mode driver, it means user mode must wait for kernel mode to finish 
unmapping memory before freeing it. Was that not what you meant?





In other words the free is not waiting for the unmap to complete, but 
causes command submissions through the kernel to depend on the unmap.


I guess I don't understand that dependency. The next command submission 
obviously cannot use the memory that was unmapped. But why does it need 
to synchronize with the unmap operation?





User mode submissions are completely unrelated to that.


I mention user mode command submission because there is no way to 
enforce the synchronization you describe here on a user mode queue. So 
this approach is not very future proof.


Regards,
  Felix




Regards,
Christian.



Regards,
  Felix




3. All VM operations requested by userspace will still be executed 
in order, e.g. we can't run unmap + map in parallel or something 
like this.


Is that something you guys can live with? As far as I can see it 
should give you the maximum freedom possible, but is still doable.


Regards,
Christian.




Re: Explicit VM updates

2022-06-01 Thread Christian König

Bas has the problem that CS implicitly waits for VM updates.

Currently when you unmap a BO the operation will only be executed after 
all the previously made CS are finished.


Similar for mapping BOs. The next CS will only start after all the 
pending page table updates are completed.


The mapping case was already handled by my prototype patch set, but the 
unmapping case still hurts a bit.


This implicit sync between CS and map/unmap operations can really hurt 
the performance of applications which massively use PRTs.


Regards,
Christian.

Am 01.06.22 um 18:27 schrieb Marek Olšák:

Can you please summarize what this is about?

Thanks,
Marek

On Wed, Jun 1, 2022 at 8:40 AM Christian König 
 wrote:


Hey guys,

so today Bas came up with a new requirement regarding the explicit
synchronization to VM updates and a bunch of prototype patches.

I've been thinking about that stuff for quite some time before,
but to
be honest it's one of the most trickiest parts of the driver.

So my current thinking is that we could potentially handle those
requirements like this:

1. We add some new EXPLICIT flag to context (or CS?) and VM IOCTL.
This
way we either get the new behavior for the whole CS+VM or the old
one,
but never both mixed.

2. When memory is unmapped we keep around the last unmap operation
inside the bo_va.

3. When memory is freed we add all the CS fences which could
access that
memory + the last unmap operation as BOOKKEEP fences to the BO and as
mandatory sync fence to the VM.

Memory freed either because of an eviction or because of userspace
closing the handle will be seen as a combination of unmap+free.


The result is the following semantic for userspace to avoid implicit
synchronization as much as possible:

1. When you allocate and map memory it is mandatory to either wait
for
the mapping operation to complete or to add it as dependency for
your CS.
 If this isn't followed the application will run into CS faults
(that's what we pretty much already implemented).

2. When memory is freed you must unmap that memory first and then
wait
for this unmap operation to complete before freeing the memory.
 If this isn't followed the kernel will add a forcefully wait
to the
next CS to block until the unmap is completed.

3. All VM operations requested by userspace will still be executed in
order, e.g. we can't run unmap + map in parallel or something like
this.

Is that something you guys can live with? As far as I can see it
should
give you the maximum freedom possible, but is still doable.

Regards,
Christian.



Re: Explicit VM updates

2022-06-01 Thread Christian König

Am 01.06.22 um 17:05 schrieb Felix Kuehling:


Am 2022-06-01 um 08:40 schrieb Christian König:

Hey guys,

so today Bas came up with a new requirement regarding the explicit 
synchronization to VM updates and a bunch of prototype patches.


I've been thinking about that stuff for quite some time before, but 
to be honest it's one of the most trickiest parts of the driver.


So my current thinking is that we could potentially handle those 
requirements like this:


1. We add some new EXPLICIT flag to context (or CS?) and VM IOCTL. 
This way we either get the new behavior for the whole CS+VM or the 
old one, but never both mixed.


2. When memory is unmapped we keep around the last unmap operation 
inside the bo_va.


3. When memory is freed we add all the CS fences which could access 
that memory + the last unmap operation as BOOKKEEP fences to the BO 
and as mandatory sync fence to the VM.


Memory freed either because of an eviction or because of userspace 
closing the handle will be seen as a combination of unmap+free.



The result is the following semantic for userspace to avoid implicit 
synchronization as much as possible:


1. When you allocate and map memory it is mandatory to either wait 
for the mapping operation to complete or to add it as dependency for 
your CS.
    If this isn't followed the application will run into CS faults 
(that's what we pretty much already implemented).


This makes sense.




2. When memory is freed you must unmap that memory first and then 
wait for this unmap operation to complete before freeing the memory.
    If this isn't followed the kernel will add a forcefully wait to 
the next CS to block until the unmap is completed.


This would work for now, but it won't work for user mode submission in 
the future. I find it weird that user mode needs to wait for the 
unmap. For user mode, unmap and free should always be asynchronous. I 
can't think of any good reasons to make user mode wait for the driver 
to clean up its stuff.


Could the waiting be done in kernel mode instead? TTM already does 
delayed freeing if there are fences outstanding on a BO being freed. 
This should make it easy to delay freeing until the unmap is done 
without blocking the user mode thread.


This is not about blocking, but synchronization dependencies.

In other words the free is not waiting for the unmap to complete, but 
causes command submissions through the kernel to depend on the unmap.


User mode submissions are completely unrelated to that.

Regards,
Christian.



Regards,
  Felix




3. All VM operations requested by userspace will still be executed in 
order, e.g. we can't run unmap + map in parallel or something like this.


Is that something you guys can live with? As far as I can see it 
should give you the maximum freedom possible, but is still doable.


Regards,
Christian.




Re: Explicit VM updates

2022-06-01 Thread Marek Olšák
Can you please summarize what this is about?

Thanks,
Marek

On Wed, Jun 1, 2022 at 8:40 AM Christian König 
wrote:

> Hey guys,
>
> so today Bas came up with a new requirement regarding the explicit
> synchronization to VM updates and a bunch of prototype patches.
>
> I've been thinking about that stuff for quite some time before, but to
> be honest it's one of the most trickiest parts of the driver.
>
> So my current thinking is that we could potentially handle those
> requirements like this:
>
> 1. We add some new EXPLICIT flag to context (or CS?) and VM IOCTL. This
> way we either get the new behavior for the whole CS+VM or the old one,
> but never both mixed.
>
> 2. When memory is unmapped we keep around the last unmap operation
> inside the bo_va.
>
> 3. When memory is freed we add all the CS fences which could access that
> memory + the last unmap operation as BOOKKEEP fences to the BO and as
> mandatory sync fence to the VM.
>
> Memory freed either because of an eviction or because of userspace
> closing the handle will be seen as a combination of unmap+free.
>
>
> The result is the following semantic for userspace to avoid implicit
> synchronization as much as possible:
>
> 1. When you allocate and map memory it is mandatory to either wait for
> the mapping operation to complete or to add it as dependency for your CS.
>  If this isn't followed the application will run into CS faults
> (that's what we pretty much already implemented).
>
> 2. When memory is freed you must unmap that memory first and then wait
> for this unmap operation to complete before freeing the memory.
>  If this isn't followed the kernel will add a forcefully wait to the
> next CS to block until the unmap is completed.
>
> 3. All VM operations requested by userspace will still be executed in
> order, e.g. we can't run unmap + map in parallel or something like this.
>
> Is that something you guys can live with? As far as I can see it should
> give you the maximum freedom possible, but is still doable.
>
> Regards,
> Christian.
>


Re: Explicit VM updates

2022-06-01 Thread Felix Kuehling



Am 2022-06-01 um 08:40 schrieb Christian König:

Hey guys,

so today Bas came up with a new requirement regarding the explicit 
synchronization to VM updates and a bunch of prototype patches.


I've been thinking about that stuff for quite some time before, but to 
be honest it's one of the most trickiest parts of the driver.


So my current thinking is that we could potentially handle those 
requirements like this:


1. We add some new EXPLICIT flag to context (or CS?) and VM IOCTL. 
This way we either get the new behavior for the whole CS+VM or the old 
one, but never both mixed.


2. When memory is unmapped we keep around the last unmap operation 
inside the bo_va.


3. When memory is freed we add all the CS fences which could access 
that memory + the last unmap operation as BOOKKEEP fences to the BO 
and as mandatory sync fence to the VM.


Memory freed either because of an eviction or because of userspace 
closing the handle will be seen as a combination of unmap+free.



The result is the following semantic for userspace to avoid implicit 
synchronization as much as possible:


1. When you allocate and map memory it is mandatory to either wait for 
the mapping operation to complete or to add it as dependency for your CS.
    If this isn't followed the application will run into CS faults 
(that's what we pretty much already implemented).


This makes sense.




2. When memory is freed you must unmap that memory first and then wait 
for this unmap operation to complete before freeing the memory.
    If this isn't followed the kernel will add a forcefully wait to 
the next CS to block until the unmap is completed.


This would work for now, but it won't work for user mode submission in 
the future. I find it weird that user mode needs to wait for the unmap. 
For user mode, unmap and free should always be asynchronous. I can't 
think of any good reasons to make user mode wait for the driver to clean 
up its stuff.


Could the waiting be done in kernel mode instead? TTM already does 
delayed freeing if there are fences outstanding on a BO being freed. 
This should make it easy to delay freeing until the unmap is done 
without blocking the user mode thread.


Regards,
  Felix




3. All VM operations requested by userspace will still be executed in 
order, e.g. we can't run unmap + map in parallel or something like this.


Is that something you guys can live with? As far as I can see it 
should give you the maximum freedom possible, but is still doable.


Regards,
Christian.


Explicit VM updates

2022-06-01 Thread Christian König

Hey guys,

so today Bas came up with a new requirement regarding the explicit 
synchronization to VM updates and a bunch of prototype patches.


I've been thinking about that stuff for quite some time before, but to 
be honest it's one of the most trickiest parts of the driver.


So my current thinking is that we could potentially handle those 
requirements like this:


1. We add some new EXPLICIT flag to context (or CS?) and VM IOCTL. This 
way we either get the new behavior for the whole CS+VM or the old one, 
but never both mixed.


2. When memory is unmapped we keep around the last unmap operation 
inside the bo_va.


3. When memory is freed we add all the CS fences which could access that 
memory + the last unmap operation as BOOKKEEP fences to the BO and as 
mandatory sync fence to the VM.


Memory freed either because of an eviction or because of userspace 
closing the handle will be seen as a combination of unmap+free.



The result is the following semantic for userspace to avoid implicit 
synchronization as much as possible:


1. When you allocate and map memory it is mandatory to either wait for 
the mapping operation to complete or to add it as dependency for your CS.
    If this isn't followed the application will run into CS faults 
(that's what we pretty much already implemented).


2. When memory is freed you must unmap that memory first and then wait 
for this unmap operation to complete before freeing the memory.
    If this isn't followed the kernel will add a forcefully wait to the 
next CS to block until the unmap is completed.


3. All VM operations requested by userspace will still be executed in 
order, e.g. we can't run unmap + map in parallel or something like this.


Is that something you guys can live with? As far as I can see it should 
give you the maximum freedom possible, but is still doable.


Regards,
Christian.