Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Christian König

Am 23.06.21 um 17:12 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 05:07:17PM +0200, Christian König wrote:

Am 23.06.21 um 17:03 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 04:58:27PM +0200, Bas Nieuwenhuizen wrote:

On Wed, Jun 23, 2021 at 4:50 PM Daniel Vetter  wrote:

On Wed, Jun 23, 2021 at 4:02 PM Christian König
 wrote:

Am 23.06.21 um 15:49 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 3:44 PM Christian König
 wrote:

Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:

On Wed, Jun 23, 2021 at 2:59 PM Christian König
 wrote:

Am 23.06.21 um 14:18 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
 wrote:

On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  wrote:

WARNING: Absolutely untested beyond "gcc isn't dying in agony".

Implicit fencing done properly needs to treat the implicit fencing
slots like a funny kind of IPC mailbox. In other words it needs to be
explicitly. This is the only way it will mesh well with explicit
fencing userspace like vk, and it's also the bare minimum required to
be able to manage anything else that wants to use the same buffer on
multiple engines in parallel, and still be able to share it through
implicit sync.

amdgpu completely lacks such an uapi. Fix this.

Luckily the concept of ignoring implicit fences exists already, and
takes care of all the complexities of making sure that non-optional
fences (like bo moves) are not ignored. This support was added in

commit 177ae09b5d699a5ebd1cafcee78889db968abf54
Author: Andres Rodriguez 
Date:   Fri Sep 15 20:44:06 2017 -0400

 drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2

Unfortuantely it's the wrong semantics, because it's a bo flag and
disables implicit sync on an allocated buffer completely.

We _do_ want implicit sync, but control it explicitly. For this we
need a flag on the drm_file, so that a given userspace (like vulkan)
can manage the implicit sync slots explicitly. The other side of the
pipeline (compositor, other process or just different stage in a media
pipeline in the same process) can then either do the same, or fully
participate in the implicit sync as implemented by the kernel by
default.

By building on the existing flag for buffers we avoid any issues with
opening up additional security concerns - anything this new flag here
allows is already.

All drivers which supports this concept of a userspace-specific
opt-out of implicit sync have a flag in their CS ioctl, but in reality
that turned out to be a bit too inflexible. See the discussion below,
let's try to do a bit better for amdgpu.

This alone only allows us to completely avoid any stalls due to
implicit sync, it does not yet allow us to use implicit sync as a
strange form of IPC for sync_file.

For that we need two more pieces:

- a way to get the current implicit sync fences out of a buffer. Could
   be done in a driver ioctl, but everyone needs this, and generally a
   dma-buf is involved anyway to establish the sharing. So an ioctl on
   the dma-buf makes a ton more sense:

   
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cbbfb6a2fd1ab4ab448d608d936594aae%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600579485508943%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=EKK8SOxfcGPIcH3EFEN656G76vQnn2jGlGyMkuY4f5k%3Dreserved=0

   Current drivers in upstream solves this by having the opt-out flag
   on their CS ioctl. This has the downside that very often the CS
   which must actually stall for the implicit fence is run a while
   after the implicit fence point was logically sampled per the api
   spec (vk passes an explicit syncobj around for that afaiui), and so
   results in oversync. Converting the implicit sync fences into a
   snap-shot sync_file is actually accurate.

- Simillar we need to be able to set the exclusive implicit fence.
   Current drivers again do this with a CS ioctl flag, with again the
   same problems that the time the CS happens additional dependencies
   have been added. An explicit ioctl to only insert a sync_file (while
   respecting the rules for how exclusive and shared fence slots must
   be update in struct dma_resv) is much better. This is proposed here:

   
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cbbfb6a2fd1ab4ab448d608d936594aae%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600579485508943%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=nrsIXQYXctDMFvYuUfM2LO%2BozUfzopfs34ZpTKjNCKo%3Dreserved=0

These three pieces together allow userspace to fully control implicit
fencing and remove all unecessary stall 

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Christian König

Am 23.06.21 um 17:03 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 04:58:27PM +0200, Bas Nieuwenhuizen wrote:

On Wed, Jun 23, 2021 at 4:50 PM Daniel Vetter  wrote:

On Wed, Jun 23, 2021 at 4:02 PM Christian König
 wrote:

Am 23.06.21 um 15:49 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 3:44 PM Christian König
 wrote:

Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:

On Wed, Jun 23, 2021 at 2:59 PM Christian König
 wrote:

Am 23.06.21 um 14:18 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
 wrote:

On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  wrote:

WARNING: Absolutely untested beyond "gcc isn't dying in agony".

Implicit fencing done properly needs to treat the implicit fencing
slots like a funny kind of IPC mailbox. In other words it needs to be
explicitly. This is the only way it will mesh well with explicit
fencing userspace like vk, and it's also the bare minimum required to
be able to manage anything else that wants to use the same buffer on
multiple engines in parallel, and still be able to share it through
implicit sync.

amdgpu completely lacks such an uapi. Fix this.

Luckily the concept of ignoring implicit fences exists already, and
takes care of all the complexities of making sure that non-optional
fences (like bo moves) are not ignored. This support was added in

commit 177ae09b5d699a5ebd1cafcee78889db968abf54
Author: Andres Rodriguez 
Date:   Fri Sep 15 20:44:06 2017 -0400

drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2

Unfortuantely it's the wrong semantics, because it's a bo flag and
disables implicit sync on an allocated buffer completely.

We _do_ want implicit sync, but control it explicitly. For this we
need a flag on the drm_file, so that a given userspace (like vulkan)
can manage the implicit sync slots explicitly. The other side of the
pipeline (compositor, other process or just different stage in a media
pipeline in the same process) can then either do the same, or fully
participate in the implicit sync as implemented by the kernel by
default.

By building on the existing flag for buffers we avoid any issues with
opening up additional security concerns - anything this new flag here
allows is already.

All drivers which supports this concept of a userspace-specific
opt-out of implicit sync have a flag in their CS ioctl, but in reality
that turned out to be a bit too inflexible. See the discussion below,
let's try to do a bit better for amdgpu.

This alone only allows us to completely avoid any stalls due to
implicit sync, it does not yet allow us to use implicit sync as a
strange form of IPC for sync_file.

For that we need two more pieces:

- a way to get the current implicit sync fences out of a buffer. Could
  be done in a driver ioctl, but everyone needs this, and generally a
  dma-buf is involved anyway to establish the sharing. So an ioctl on
  the dma-buf makes a ton more sense:

  
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C517f0d3467324e7ce05008d936581f60%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600574408265873%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=gntXLzlrqPxYj4Q3mQflD3arT9ad40S9AqsvtOXV4nk%3Dreserved=0

  Current drivers in upstream solves this by having the opt-out flag
  on their CS ioctl. This has the downside that very often the CS
  which must actually stall for the implicit fence is run a while
  after the implicit fence point was logically sampled per the api
  spec (vk passes an explicit syncobj around for that afaiui), and so
  results in oversync. Converting the implicit sync fences into a
  snap-shot sync_file is actually accurate.

- Simillar we need to be able to set the exclusive implicit fence.
  Current drivers again do this with a CS ioctl flag, with again the
  same problems that the time the CS happens additional dependencies
  have been added. An explicit ioctl to only insert a sync_file (while
  respecting the rules for how exclusive and shared fence slots must
  be update in struct dma_resv) is much better. This is proposed here:

  
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C517f0d3467324e7ce05008d936581f60%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600574408265873%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=YtqHT756jlt5NX7Ydr3Kk1UMTb98nQhlcOlrnr%2B48HE%3Dreserved=0

These three pieces together allow userspace to fully control implicit
fencing and remove all unecessary stall points due to them.

Well, as much as the implicit fencing model fundamentally allows:
There is only one set of fences, you can only 

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Christian König

Am 23.06.21 um 15:49 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 3:44 PM Christian König
 wrote:

Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:

On Wed, Jun 23, 2021 at 2:59 PM Christian König
 wrote:

Am 23.06.21 um 14:18 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
 wrote:

On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  wrote:

WARNING: Absolutely untested beyond "gcc isn't dying in agony".

Implicit fencing done properly needs to treat the implicit fencing
slots like a funny kind of IPC mailbox. In other words it needs to be
explicitly. This is the only way it will mesh well with explicit
fencing userspace like vk, and it's also the bare minimum required to
be able to manage anything else that wants to use the same buffer on
multiple engines in parallel, and still be able to share it through
implicit sync.

amdgpu completely lacks such an uapi. Fix this.

Luckily the concept of ignoring implicit fences exists already, and
takes care of all the complexities of making sure that non-optional
fences (like bo moves) are not ignored. This support was added in

commit 177ae09b5d699a5ebd1cafcee78889db968abf54
Author: Andres Rodriguez 
Date:   Fri Sep 15 20:44:06 2017 -0400

   drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2

Unfortuantely it's the wrong semantics, because it's a bo flag and
disables implicit sync on an allocated buffer completely.

We _do_ want implicit sync, but control it explicitly. For this we
need a flag on the drm_file, so that a given userspace (like vulkan)
can manage the implicit sync slots explicitly. The other side of the
pipeline (compositor, other process or just different stage in a media
pipeline in the same process) can then either do the same, or fully
participate in the implicit sync as implemented by the kernel by
default.

By building on the existing flag for buffers we avoid any issues with
opening up additional security concerns - anything this new flag here
allows is already.

All drivers which supports this concept of a userspace-specific
opt-out of implicit sync have a flag in their CS ioctl, but in reality
that turned out to be a bit too inflexible. See the discussion below,
let's try to do a bit better for amdgpu.

This alone only allows us to completely avoid any stalls due to
implicit sync, it does not yet allow us to use implicit sync as a
strange form of IPC for sync_file.

For that we need two more pieces:

- a way to get the current implicit sync fences out of a buffer. Could
 be done in a driver ioctl, but everyone needs this, and generally a
 dma-buf is involved anyway to establish the sharing. So an ioctl on
 the dma-buf makes a ton more sense:

 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C83dbdd0a1eb8442cbf7108d9364db51e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600529684040802%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=fbdwtutEj93anZp6Pshs277QoMTHZxIy0Yl54T95rCw%3Dreserved=0

 Current drivers in upstream solves this by having the opt-out flag
 on their CS ioctl. This has the downside that very often the CS
 which must actually stall for the implicit fence is run a while
 after the implicit fence point was logically sampled per the api
 spec (vk passes an explicit syncobj around for that afaiui), and so
 results in oversync. Converting the implicit sync fences into a
 snap-shot sync_file is actually accurate.

- Simillar we need to be able to set the exclusive implicit fence.
 Current drivers again do this with a CS ioctl flag, with again the
 same problems that the time the CS happens additional dependencies
 have been added. An explicit ioctl to only insert a sync_file (while
 respecting the rules for how exclusive and shared fence slots must
 be update in struct dma_resv) is much better. This is proposed here:

 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C83dbdd0a1eb8442cbf7108d9364db51e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600529684040802%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=vv%2BREnWorjoTOwrD1jH1GHVQcjPy1oesaophsz056aI%3Dreserved=0

These three pieces together allow userspace to fully control implicit
fencing and remove all unecessary stall points due to them.

Well, as much as the implicit fencing model fundamentally allows:
There is only one set of fences, you can only choose to sync against
only writers (exclusive slot), or everyone. Hence suballocating
multiple buffers or anything else like this is fundamentally not
possible, and can only be fixed by a proper explicit fencing model.

Aside from that caveat 

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Christian König

Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:

On Wed, Jun 23, 2021 at 2:59 PM Christian König
 wrote:

Am 23.06.21 um 14:18 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
 wrote:

On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  wrote:

WARNING: Absolutely untested beyond "gcc isn't dying in agony".

Implicit fencing done properly needs to treat the implicit fencing
slots like a funny kind of IPC mailbox. In other words it needs to be
explicitly. This is the only way it will mesh well with explicit
fencing userspace like vk, and it's also the bare minimum required to
be able to manage anything else that wants to use the same buffer on
multiple engines in parallel, and still be able to share it through
implicit sync.

amdgpu completely lacks such an uapi. Fix this.

Luckily the concept of ignoring implicit fences exists already, and
takes care of all the complexities of making sure that non-optional
fences (like bo moves) are not ignored. This support was added in

commit 177ae09b5d699a5ebd1cafcee78889db968abf54
Author: Andres Rodriguez 
Date:   Fri Sep 15 20:44:06 2017 -0400

  drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2

Unfortuantely it's the wrong semantics, because it's a bo flag and
disables implicit sync on an allocated buffer completely.

We _do_ want implicit sync, but control it explicitly. For this we
need a flag on the drm_file, so that a given userspace (like vulkan)
can manage the implicit sync slots explicitly. The other side of the
pipeline (compositor, other process or just different stage in a media
pipeline in the same process) can then either do the same, or fully
participate in the implicit sync as implemented by the kernel by
default.

By building on the existing flag for buffers we avoid any issues with
opening up additional security concerns - anything this new flag here
allows is already.

All drivers which supports this concept of a userspace-specific
opt-out of implicit sync have a flag in their CS ioctl, but in reality
that turned out to be a bit too inflexible. See the discussion below,
let's try to do a bit better for amdgpu.

This alone only allows us to completely avoid any stalls due to
implicit sync, it does not yet allow us to use implicit sync as a
strange form of IPC for sync_file.

For that we need two more pieces:

- a way to get the current implicit sync fences out of a buffer. Could
be done in a driver ioctl, but everyone needs this, and generally a
dma-buf is involved anyway to establish the sharing. So an ioctl on
the dma-buf makes a ton more sense:


https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Ca401fc4551f045c95d8808d9364c38f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600523287217723%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=L8KCz8711Y2qZx0%2FJWT6HSg4o6OMhn%2BC4U2IR06nViE%3Dreserved=0

Current drivers in upstream solves this by having the opt-out flag
on their CS ioctl. This has the downside that very often the CS
which must actually stall for the implicit fence is run a while
after the implicit fence point was logically sampled per the api
spec (vk passes an explicit syncobj around for that afaiui), and so
results in oversync. Converting the implicit sync fences into a
snap-shot sync_file is actually accurate.

- Simillar we need to be able to set the exclusive implicit fence.
Current drivers again do this with a CS ioctl flag, with again the
same problems that the time the CS happens additional dependencies
have been added. An explicit ioctl to only insert a sync_file (while
respecting the rules for how exclusive and shared fence slots must
be update in struct dma_resv) is much better. This is proposed here:


https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Ca401fc4551f045c95d8808d9364c38f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600523287227719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=8Ws%2B573T5rj9Bs08%2BQB5CbIAsWgo36hYiH%2Fd0dPcJeg%3Dreserved=0

These three pieces together allow userspace to fully control implicit
fencing and remove all unecessary stall points due to them.

Well, as much as the implicit fencing model fundamentally allows:
There is only one set of fences, you can only choose to sync against
only writers (exclusive slot), or everyone. Hence suballocating
multiple buffers or anything else like this is fundamentally not
possible, and can only be fixed by a proper explicit fencing model.

Aside from that caveat this model gets implicit fencing as closely to
explicit fencing semantics as possible:

On the actual 

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Christian König

Am 23.06.21 um 14:18 schrieb Daniel Vetter:

On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
 wrote:

On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  wrote:

WARNING: Absolutely untested beyond "gcc isn't dying in agony".

Implicit fencing done properly needs to treat the implicit fencing
slots like a funny kind of IPC mailbox. In other words it needs to be
explicitly. This is the only way it will mesh well with explicit
fencing userspace like vk, and it's also the bare minimum required to
be able to manage anything else that wants to use the same buffer on
multiple engines in parallel, and still be able to share it through
implicit sync.

amdgpu completely lacks such an uapi. Fix this.

Luckily the concept of ignoring implicit fences exists already, and
takes care of all the complexities of making sure that non-optional
fences (like bo moves) are not ignored. This support was added in

commit 177ae09b5d699a5ebd1cafcee78889db968abf54
Author: Andres Rodriguez 
Date:   Fri Sep 15 20:44:06 2017 -0400

 drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2

Unfortuantely it's the wrong semantics, because it's a bo flag and
disables implicit sync on an allocated buffer completely.

We _do_ want implicit sync, but control it explicitly. For this we
need a flag on the drm_file, so that a given userspace (like vulkan)
can manage the implicit sync slots explicitly. The other side of the
pipeline (compositor, other process or just different stage in a media
pipeline in the same process) can then either do the same, or fully
participate in the implicit sync as implemented by the kernel by
default.

By building on the existing flag for buffers we avoid any issues with
opening up additional security concerns - anything this new flag here
allows is already.

All drivers which supports this concept of a userspace-specific
opt-out of implicit sync have a flag in their CS ioctl, but in reality
that turned out to be a bit too inflexible. See the discussion below,
let's try to do a bit better for amdgpu.

This alone only allows us to completely avoid any stalls due to
implicit sync, it does not yet allow us to use implicit sync as a
strange form of IPC for sync_file.

For that we need two more pieces:

- a way to get the current implicit sync fences out of a buffer. Could
   be done in a driver ioctl, but everyone needs this, and generally a
   dma-buf is involved anyway to establish the sharing. So an ioctl on
   the dma-buf makes a ton more sense:

   
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cf026055f523d4e4df95b08d936410e39%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600475351085536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=gUnM8%2Fulx%2B%2BDLxByO%2F0V3cLqt%2Fc2unWjizEpptQqM8g%3Dreserved=0

   Current drivers in upstream solves this by having the opt-out flag
   on their CS ioctl. This has the downside that very often the CS
   which must actually stall for the implicit fence is run a while
   after the implicit fence point was logically sampled per the api
   spec (vk passes an explicit syncobj around for that afaiui), and so
   results in oversync. Converting the implicit sync fences into a
   snap-shot sync_file is actually accurate.

- Simillar we need to be able to set the exclusive implicit fence.
   Current drivers again do this with a CS ioctl flag, with again the
   same problems that the time the CS happens additional dependencies
   have been added. An explicit ioctl to only insert a sync_file (while
   respecting the rules for how exclusive and shared fence slots must
   be update in struct dma_resv) is much better. This is proposed here:

   
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cf026055f523d4e4df95b08d936410e39%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600475351085536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=wFGNyeL1YSpkebf1L1DDb2euihf1fvmR9G8cfywrpVU%3Dreserved=0

These three pieces together allow userspace to fully control implicit
fencing and remove all unecessary stall points due to them.

Well, as much as the implicit fencing model fundamentally allows:
There is only one set of fences, you can only choose to sync against
only writers (exclusive slot), or everyone. Hence suballocating
multiple buffers or anything else like this is fundamentally not
possible, and can only be fixed by a proper explicit fencing model.

Aside from that caveat this model gets implicit fencing as closely to
explicit fencing semantics as possible:

On the actual implementation I opted for a simple setparam ioctl, no
locking (just atomic reads/writes) for simplicity. There is a nice
flag 

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Daniel Vetter
On Wed, Jun 23, 2021 at 05:07:17PM +0200, Christian König wrote:
> Am 23.06.21 um 17:03 schrieb Daniel Vetter:
> > On Wed, Jun 23, 2021 at 04:58:27PM +0200, Bas Nieuwenhuizen wrote:
> > > On Wed, Jun 23, 2021 at 4:50 PM Daniel Vetter  
> > > wrote:
> > > > On Wed, Jun 23, 2021 at 4:02 PM Christian König
> > > >  wrote:
> > > > > Am 23.06.21 um 15:49 schrieb Daniel Vetter:
> > > > > > On Wed, Jun 23, 2021 at 3:44 PM Christian König
> > > > > >  wrote:
> > > > > > > Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:
> > > > > > > > On Wed, Jun 23, 2021 at 2:59 PM Christian König
> > > > > > > >  wrote:
> > > > > > > > > Am 23.06.21 um 14:18 schrieb Daniel Vetter:
> > > > > > > > > > On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
> > > > > > > > > >  wrote:
> > > > > > > > > > > On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > > WARNING: Absolutely untested beyond "gcc isn't dying in 
> > > > > > > > > > > > agony".
> > > > > > > > > > > > 
> > > > > > > > > > > > Implicit fencing done properly needs to treat the 
> > > > > > > > > > > > implicit fencing
> > > > > > > > > > > > slots like a funny kind of IPC mailbox. In other words 
> > > > > > > > > > > > it needs to be
> > > > > > > > > > > > explicitly. This is the only way it will mesh well with 
> > > > > > > > > > > > explicit
> > > > > > > > > > > > fencing userspace like vk, and it's also the bare 
> > > > > > > > > > > > minimum required to
> > > > > > > > > > > > be able to manage anything else that wants to use the 
> > > > > > > > > > > > same buffer on
> > > > > > > > > > > > multiple engines in parallel, and still be able to 
> > > > > > > > > > > > share it through
> > > > > > > > > > > > implicit sync.
> > > > > > > > > > > > 
> > > > > > > > > > > > amdgpu completely lacks such an uapi. Fix this.
> > > > > > > > > > > > 
> > > > > > > > > > > > Luckily the concept of ignoring implicit fences exists 
> > > > > > > > > > > > already, and
> > > > > > > > > > > > takes care of all the complexities of making sure that 
> > > > > > > > > > > > non-optional
> > > > > > > > > > > > fences (like bo moves) are not ignored. This support 
> > > > > > > > > > > > was added in
> > > > > > > > > > > > 
> > > > > > > > > > > > commit 177ae09b5d699a5ebd1cafcee78889db968abf54
> > > > > > > > > > > > Author: Andres Rodriguez 
> > > > > > > > > > > > Date:   Fri Sep 15 20:44:06 2017 -0400
> > > > > > > > > > > > 
> > > > > > > > > > > > drm/amdgpu: introduce 
> > > > > > > > > > > > AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2
> > > > > > > > > > > > 
> > > > > > > > > > > > Unfortuantely it's the wrong semantics, because it's a 
> > > > > > > > > > > > bo flag and
> > > > > > > > > > > > disables implicit sync on an allocated buffer 
> > > > > > > > > > > > completely.
> > > > > > > > > > > > 
> > > > > > > > > > > > We _do_ want implicit sync, but control it explicitly. 
> > > > > > > > > > > > For this we
> > > > > > > > > > > > need a flag on the drm_file, so that a given userspace 
> > > > > > > > > > > > (like vulkan)
> > > > > > > > > > > > can manage the implicit sync slots explicitly. The 
> > > > > > > > > > > > other side of the
> > > > > > > > > > > > pipeline (compositor, other process or just different 
> > > > > > > > > > > > stage in a media
> > > > > > > > > > > > pipeline in the same process) can then either do the 
> > > > > > > > > > > > same, or fully
> > > > > > > > > > > > participate in the implicit sync as implemented by the 
> > > > > > > > > > > > kernel by
> > > > > > > > > > > > default.
> > > > > > > > > > > > 
> > > > > > > > > > > > By building on the existing flag for buffers we avoid 
> > > > > > > > > > > > any issues with
> > > > > > > > > > > > opening up additional security concerns - anything this 
> > > > > > > > > > > > new flag here
> > > > > > > > > > > > allows is already.
> > > > > > > > > > > > 
> > > > > > > > > > > > All drivers which supports this concept of a 
> > > > > > > > > > > > userspace-specific
> > > > > > > > > > > > opt-out of implicit sync have a flag in their CS ioctl, 
> > > > > > > > > > > > but in reality
> > > > > > > > > > > > that turned out to be a bit too inflexible. See the 
> > > > > > > > > > > > discussion below,
> > > > > > > > > > > > let's try to do a bit better for amdgpu.
> > > > > > > > > > > > 
> > > > > > > > > > > > This alone only allows us to completely avoid any 
> > > > > > > > > > > > stalls due to
> > > > > > > > > > > > implicit sync, it does not yet allow us to use implicit 
> > > > > > > > > > > > sync as a
> > > > > > > > > > > > strange form of IPC for sync_file.
> > > > > > > > > > > > 
> > > > > > > > > > > > For that we need two more pieces:
> > > > > > > > > > > > 
> > > > > > > > > > > > - a way to get the current implicit sync fences out of 
> > > > > > > > > > > > a buffer. Could
> > > > > > > > > > > >   be done in a driver ioctl, but everyone needs 
> > > > > > > > > > > > this, and generally a
> > > > > > > 

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Daniel Vetter
On Wed, Jun 23, 2021 at 04:58:27PM +0200, Bas Nieuwenhuizen wrote:
> On Wed, Jun 23, 2021 at 4:50 PM Daniel Vetter  wrote:
> >
> > On Wed, Jun 23, 2021 at 4:02 PM Christian König
> >  wrote:
> > >
> > > Am 23.06.21 um 15:49 schrieb Daniel Vetter:
> > > > On Wed, Jun 23, 2021 at 3:44 PM Christian König
> > > >  wrote:
> > > >> Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:
> > > >>> On Wed, Jun 23, 2021 at 2:59 PM Christian König
> > > >>>  wrote:
> > >  Am 23.06.21 um 14:18 schrieb Daniel Vetter:
> > > > On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
> > > >  wrote:
> > > >> On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter 
> > > >>  wrote:
> > > >>> WARNING: Absolutely untested beyond "gcc isn't dying in agony".
> > > >>>
> > > >>> Implicit fencing done properly needs to treat the implicit fencing
> > > >>> slots like a funny kind of IPC mailbox. In other words it needs 
> > > >>> to be
> > > >>> explicitly. This is the only way it will mesh well with explicit
> > > >>> fencing userspace like vk, and it's also the bare minimum 
> > > >>> required to
> > > >>> be able to manage anything else that wants to use the same buffer 
> > > >>> on
> > > >>> multiple engines in parallel, and still be able to share it 
> > > >>> through
> > > >>> implicit sync.
> > > >>>
> > > >>> amdgpu completely lacks such an uapi. Fix this.
> > > >>>
> > > >>> Luckily the concept of ignoring implicit fences exists already, 
> > > >>> and
> > > >>> takes care of all the complexities of making sure that 
> > > >>> non-optional
> > > >>> fences (like bo moves) are not ignored. This support was added in
> > > >>>
> > > >>> commit 177ae09b5d699a5ebd1cafcee78889db968abf54
> > > >>> Author: Andres Rodriguez 
> > > >>> Date:   Fri Sep 15 20:44:06 2017 -0400
> > > >>>
> > > >>>drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2
> > > >>>
> > > >>> Unfortuantely it's the wrong semantics, because it's a bo flag and
> > > >>> disables implicit sync on an allocated buffer completely.
> > > >>>
> > > >>> We _do_ want implicit sync, but control it explicitly. For this we
> > > >>> need a flag on the drm_file, so that a given userspace (like 
> > > >>> vulkan)
> > > >>> can manage the implicit sync slots explicitly. The other side of 
> > > >>> the
> > > >>> pipeline (compositor, other process or just different stage in a 
> > > >>> media
> > > >>> pipeline in the same process) can then either do the same, or 
> > > >>> fully
> > > >>> participate in the implicit sync as implemented by the kernel by
> > > >>> default.
> > > >>>
> > > >>> By building on the existing flag for buffers we avoid any issues 
> > > >>> with
> > > >>> opening up additional security concerns - anything this new flag 
> > > >>> here
> > > >>> allows is already.
> > > >>>
> > > >>> All drivers which supports this concept of a userspace-specific
> > > >>> opt-out of implicit sync have a flag in their CS ioctl, but in 
> > > >>> reality
> > > >>> that turned out to be a bit too inflexible. See the discussion 
> > > >>> below,
> > > >>> let's try to do a bit better for amdgpu.
> > > >>>
> > > >>> This alone only allows us to completely avoid any stalls due to
> > > >>> implicit sync, it does not yet allow us to use implicit sync as a
> > > >>> strange form of IPC for sync_file.
> > > >>>
> > > >>> For that we need two more pieces:
> > > >>>
> > > >>> - a way to get the current implicit sync fences out of a buffer. 
> > > >>> Could
> > > >>>  be done in a driver ioctl, but everyone needs this, and 
> > > >>> generally a
> > > >>>  dma-buf is involved anyway to establish the sharing. So an 
> > > >>> ioctl on
> > > >>>  the dma-buf makes a ton more sense:
> > > >>>
> > > >>>  
> > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C83dbdd0a1eb8442cbf7108d9364db51e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600529684040802%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=fbdwtutEj93anZp6Pshs277QoMTHZxIy0Yl54T95rCw%3Dreserved=0
> > > >>>
> > > >>>  Current drivers in upstream solves this by having the 
> > > >>> opt-out flag
> > > >>>  on their CS ioctl. This has the downside that very often the 
> > > >>> CS
> > > >>>  which must actually stall for the implicit fence is run a 
> > > >>> while
> > > >>>  after the implicit fence point was logically sampled per the 
> > > >>> api
> > > >>>  spec (vk passes an explicit syncobj around for that afaiui), 
> > > >>> and so
> > > >>>  

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Bas Nieuwenhuizen
On Wed, Jun 23, 2021 at 4:50 PM Daniel Vetter  wrote:
>
> On Wed, Jun 23, 2021 at 4:02 PM Christian König
>  wrote:
> >
> > Am 23.06.21 um 15:49 schrieb Daniel Vetter:
> > > On Wed, Jun 23, 2021 at 3:44 PM Christian König
> > >  wrote:
> > >> Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:
> > >>> On Wed, Jun 23, 2021 at 2:59 PM Christian König
> > >>>  wrote:
> >  Am 23.06.21 um 14:18 schrieb Daniel Vetter:
> > > On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
> > >  wrote:
> > >> On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter 
> > >>  wrote:
> > >>> WARNING: Absolutely untested beyond "gcc isn't dying in agony".
> > >>>
> > >>> Implicit fencing done properly needs to treat the implicit fencing
> > >>> slots like a funny kind of IPC mailbox. In other words it needs to 
> > >>> be
> > >>> explicitly. This is the only way it will mesh well with explicit
> > >>> fencing userspace like vk, and it's also the bare minimum required 
> > >>> to
> > >>> be able to manage anything else that wants to use the same buffer on
> > >>> multiple engines in parallel, and still be able to share it through
> > >>> implicit sync.
> > >>>
> > >>> amdgpu completely lacks such an uapi. Fix this.
> > >>>
> > >>> Luckily the concept of ignoring implicit fences exists already, and
> > >>> takes care of all the complexities of making sure that non-optional
> > >>> fences (like bo moves) are not ignored. This support was added in
> > >>>
> > >>> commit 177ae09b5d699a5ebd1cafcee78889db968abf54
> > >>> Author: Andres Rodriguez 
> > >>> Date:   Fri Sep 15 20:44:06 2017 -0400
> > >>>
> > >>>drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2
> > >>>
> > >>> Unfortuantely it's the wrong semantics, because it's a bo flag and
> > >>> disables implicit sync on an allocated buffer completely.
> > >>>
> > >>> We _do_ want implicit sync, but control it explicitly. For this we
> > >>> need a flag on the drm_file, so that a given userspace (like vulkan)
> > >>> can manage the implicit sync slots explicitly. The other side of the
> > >>> pipeline (compositor, other process or just different stage in a 
> > >>> media
> > >>> pipeline in the same process) can then either do the same, or fully
> > >>> participate in the implicit sync as implemented by the kernel by
> > >>> default.
> > >>>
> > >>> By building on the existing flag for buffers we avoid any issues 
> > >>> with
> > >>> opening up additional security concerns - anything this new flag 
> > >>> here
> > >>> allows is already.
> > >>>
> > >>> All drivers which supports this concept of a userspace-specific
> > >>> opt-out of implicit sync have a flag in their CS ioctl, but in 
> > >>> reality
> > >>> that turned out to be a bit too inflexible. See the discussion 
> > >>> below,
> > >>> let's try to do a bit better for amdgpu.
> > >>>
> > >>> This alone only allows us to completely avoid any stalls due to
> > >>> implicit sync, it does not yet allow us to use implicit sync as a
> > >>> strange form of IPC for sync_file.
> > >>>
> > >>> For that we need two more pieces:
> > >>>
> > >>> - a way to get the current implicit sync fences out of a buffer. 
> > >>> Could
> > >>>  be done in a driver ioctl, but everyone needs this, and 
> > >>> generally a
> > >>>  dma-buf is involved anyway to establish the sharing. So an 
> > >>> ioctl on
> > >>>  the dma-buf makes a ton more sense:
> > >>>
> > >>>  
> > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C83dbdd0a1eb8442cbf7108d9364db51e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600529684040802%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=fbdwtutEj93anZp6Pshs277QoMTHZxIy0Yl54T95rCw%3Dreserved=0
> > >>>
> > >>>  Current drivers in upstream solves this by having the opt-out 
> > >>> flag
> > >>>  on their CS ioctl. This has the downside that very often the CS
> > >>>  which must actually stall for the implicit fence is run a while
> > >>>  after the implicit fence point was logically sampled per the 
> > >>> api
> > >>>  spec (vk passes an explicit syncobj around for that afaiui), 
> > >>> and so
> > >>>  results in oversync. Converting the implicit sync fences into a
> > >>>  snap-shot sync_file is actually accurate.
> > >>>
> > >>> - Simillar we need to be able to set the exclusive implicit fence.
> > >>>  Current drivers again do this with a CS ioctl flag, with again 
> > >>> the
> > >>>  same problems that the time the CS happens additional 

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Daniel Vetter
On Wed, Jun 23, 2021 at 4:02 PM Christian König
 wrote:
>
> Am 23.06.21 um 15:49 schrieb Daniel Vetter:
> > On Wed, Jun 23, 2021 at 3:44 PM Christian König
> >  wrote:
> >> Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:
> >>> On Wed, Jun 23, 2021 at 2:59 PM Christian König
> >>>  wrote:
>  Am 23.06.21 um 14:18 schrieb Daniel Vetter:
> > On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
> >  wrote:
> >> On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  
> >> wrote:
> >>> WARNING: Absolutely untested beyond "gcc isn't dying in agony".
> >>>
> >>> Implicit fencing done properly needs to treat the implicit fencing
> >>> slots like a funny kind of IPC mailbox. In other words it needs to be
> >>> explicitly. This is the only way it will mesh well with explicit
> >>> fencing userspace like vk, and it's also the bare minimum required to
> >>> be able to manage anything else that wants to use the same buffer on
> >>> multiple engines in parallel, and still be able to share it through
> >>> implicit sync.
> >>>
> >>> amdgpu completely lacks such an uapi. Fix this.
> >>>
> >>> Luckily the concept of ignoring implicit fences exists already, and
> >>> takes care of all the complexities of making sure that non-optional
> >>> fences (like bo moves) are not ignored. This support was added in
> >>>
> >>> commit 177ae09b5d699a5ebd1cafcee78889db968abf54
> >>> Author: Andres Rodriguez 
> >>> Date:   Fri Sep 15 20:44:06 2017 -0400
> >>>
> >>>drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2
> >>>
> >>> Unfortuantely it's the wrong semantics, because it's a bo flag and
> >>> disables implicit sync on an allocated buffer completely.
> >>>
> >>> We _do_ want implicit sync, but control it explicitly. For this we
> >>> need a flag on the drm_file, so that a given userspace (like vulkan)
> >>> can manage the implicit sync slots explicitly. The other side of the
> >>> pipeline (compositor, other process or just different stage in a media
> >>> pipeline in the same process) can then either do the same, or fully
> >>> participate in the implicit sync as implemented by the kernel by
> >>> default.
> >>>
> >>> By building on the existing flag for buffers we avoid any issues with
> >>> opening up additional security concerns - anything this new flag here
> >>> allows is already.
> >>>
> >>> All drivers which supports this concept of a userspace-specific
> >>> opt-out of implicit sync have a flag in their CS ioctl, but in reality
> >>> that turned out to be a bit too inflexible. See the discussion below,
> >>> let's try to do a bit better for amdgpu.
> >>>
> >>> This alone only allows us to completely avoid any stalls due to
> >>> implicit sync, it does not yet allow us to use implicit sync as a
> >>> strange form of IPC for sync_file.
> >>>
> >>> For that we need two more pieces:
> >>>
> >>> - a way to get the current implicit sync fences out of a buffer. Could
> >>>  be done in a driver ioctl, but everyone needs this, and 
> >>> generally a
> >>>  dma-buf is involved anyway to establish the sharing. So an ioctl 
> >>> on
> >>>  the dma-buf makes a ton more sense:
> >>>
> >>>  
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C83dbdd0a1eb8442cbf7108d9364db51e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600529684040802%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=fbdwtutEj93anZp6Pshs277QoMTHZxIy0Yl54T95rCw%3Dreserved=0
> >>>
> >>>  Current drivers in upstream solves this by having the opt-out 
> >>> flag
> >>>  on their CS ioctl. This has the downside that very often the CS
> >>>  which must actually stall for the implicit fence is run a while
> >>>  after the implicit fence point was logically sampled per the api
> >>>  spec (vk passes an explicit syncobj around for that afaiui), and 
> >>> so
> >>>  results in oversync. Converting the implicit sync fences into a
> >>>  snap-shot sync_file is actually accurate.
> >>>
> >>> - Simillar we need to be able to set the exclusive implicit fence.
> >>>  Current drivers again do this with a CS ioctl flag, with again 
> >>> the
> >>>  same problems that the time the CS happens additional 
> >>> dependencies
> >>>  have been added. An explicit ioctl to only insert a sync_file 
> >>> (while
> >>>  respecting the rules for how exclusive and shared fence slots 
> >>> must
> >>>  be update in struct dma_resv) is much better. This is proposed 
> >>> here:
> >>>
> >>>  
> >>> 

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Daniel Vetter
On Wed, Jun 23, 2021 at 3:44 PM Christian König
 wrote:
>
> Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen:
> > On Wed, Jun 23, 2021 at 2:59 PM Christian König
> >  wrote:
> >> Am 23.06.21 um 14:18 schrieb Daniel Vetter:
> >>> On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
> >>>  wrote:
>  On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  
>  wrote:
> > WARNING: Absolutely untested beyond "gcc isn't dying in agony".
> >
> > Implicit fencing done properly needs to treat the implicit fencing
> > slots like a funny kind of IPC mailbox. In other words it needs to be
> > explicitly. This is the only way it will mesh well with explicit
> > fencing userspace like vk, and it's also the bare minimum required to
> > be able to manage anything else that wants to use the same buffer on
> > multiple engines in parallel, and still be able to share it through
> > implicit sync.
> >
> > amdgpu completely lacks such an uapi. Fix this.
> >
> > Luckily the concept of ignoring implicit fences exists already, and
> > takes care of all the complexities of making sure that non-optional
> > fences (like bo moves) are not ignored. This support was added in
> >
> > commit 177ae09b5d699a5ebd1cafcee78889db968abf54
> > Author: Andres Rodriguez 
> > Date:   Fri Sep 15 20:44:06 2017 -0400
> >
> >   drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2
> >
> > Unfortuantely it's the wrong semantics, because it's a bo flag and
> > disables implicit sync on an allocated buffer completely.
> >
> > We _do_ want implicit sync, but control it explicitly. For this we
> > need a flag on the drm_file, so that a given userspace (like vulkan)
> > can manage the implicit sync slots explicitly. The other side of the
> > pipeline (compositor, other process or just different stage in a media
> > pipeline in the same process) can then either do the same, or fully
> > participate in the implicit sync as implemented by the kernel by
> > default.
> >
> > By building on the existing flag for buffers we avoid any issues with
> > opening up additional security concerns - anything this new flag here
> > allows is already.
> >
> > All drivers which supports this concept of a userspace-specific
> > opt-out of implicit sync have a flag in their CS ioctl, but in reality
> > that turned out to be a bit too inflexible. See the discussion below,
> > let's try to do a bit better for amdgpu.
> >
> > This alone only allows us to completely avoid any stalls due to
> > implicit sync, it does not yet allow us to use implicit sync as a
> > strange form of IPC for sync_file.
> >
> > For that we need two more pieces:
> >
> > - a way to get the current implicit sync fences out of a buffer. Could
> > be done in a driver ioctl, but everyone needs this, and generally a
> > dma-buf is involved anyway to establish the sharing. So an ioctl on
> > the dma-buf makes a ton more sense:
> >
> > 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Ca401fc4551f045c95d8808d9364c38f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600523287217723%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=L8KCz8711Y2qZx0%2FJWT6HSg4o6OMhn%2BC4U2IR06nViE%3Dreserved=0
> >
> > Current drivers in upstream solves this by having the opt-out flag
> > on their CS ioctl. This has the downside that very often the CS
> > which must actually stall for the implicit fence is run a while
> > after the implicit fence point was logically sampled per the api
> > spec (vk passes an explicit syncobj around for that afaiui), and so
> > results in oversync. Converting the implicit sync fences into a
> > snap-shot sync_file is actually accurate.
> >
> > - Simillar we need to be able to set the exclusive implicit fence.
> > Current drivers again do this with a CS ioctl flag, with again the
> > same problems that the time the CS happens additional dependencies
> > have been added. An explicit ioctl to only insert a sync_file (while
> > respecting the rules for how exclusive and shared fence slots must
> > be update in struct dma_resv) is much better. This is proposed here:
> >
> > 
> > 

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Bas Nieuwenhuizen
On Wed, Jun 23, 2021 at 2:59 PM Christian König
 wrote:
>
> Am 23.06.21 um 14:18 schrieb Daniel Vetter:
> > On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
> >  wrote:
> >> On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  
> >> wrote:
> >>> WARNING: Absolutely untested beyond "gcc isn't dying in agony".
> >>>
> >>> Implicit fencing done properly needs to treat the implicit fencing
> >>> slots like a funny kind of IPC mailbox. In other words it needs to be
> >>> explicitly. This is the only way it will mesh well with explicit
> >>> fencing userspace like vk, and it's also the bare minimum required to
> >>> be able to manage anything else that wants to use the same buffer on
> >>> multiple engines in parallel, and still be able to share it through
> >>> implicit sync.
> >>>
> >>> amdgpu completely lacks such an uapi. Fix this.
> >>>
> >>> Luckily the concept of ignoring implicit fences exists already, and
> >>> takes care of all the complexities of making sure that non-optional
> >>> fences (like bo moves) are not ignored. This support was added in
> >>>
> >>> commit 177ae09b5d699a5ebd1cafcee78889db968abf54
> >>> Author: Andres Rodriguez 
> >>> Date:   Fri Sep 15 20:44:06 2017 -0400
> >>>
> >>>  drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2
> >>>
> >>> Unfortuantely it's the wrong semantics, because it's a bo flag and
> >>> disables implicit sync on an allocated buffer completely.
> >>>
> >>> We _do_ want implicit sync, but control it explicitly. For this we
> >>> need a flag on the drm_file, so that a given userspace (like vulkan)
> >>> can manage the implicit sync slots explicitly. The other side of the
> >>> pipeline (compositor, other process or just different stage in a media
> >>> pipeline in the same process) can then either do the same, or fully
> >>> participate in the implicit sync as implemented by the kernel by
> >>> default.
> >>>
> >>> By building on the existing flag for buffers we avoid any issues with
> >>> opening up additional security concerns - anything this new flag here
> >>> allows is already.
> >>>
> >>> All drivers which supports this concept of a userspace-specific
> >>> opt-out of implicit sync have a flag in their CS ioctl, but in reality
> >>> that turned out to be a bit too inflexible. See the discussion below,
> >>> let's try to do a bit better for amdgpu.
> >>>
> >>> This alone only allows us to completely avoid any stalls due to
> >>> implicit sync, it does not yet allow us to use implicit sync as a
> >>> strange form of IPC for sync_file.
> >>>
> >>> For that we need two more pieces:
> >>>
> >>> - a way to get the current implicit sync fences out of a buffer. Could
> >>>be done in a driver ioctl, but everyone needs this, and generally a
> >>>dma-buf is involved anyway to establish the sharing. So an ioctl on
> >>>the dma-buf makes a ton more sense:
> >>>
> >>>
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cf026055f523d4e4df95b08d936410e39%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600475351085536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=gUnM8%2Fulx%2B%2BDLxByO%2F0V3cLqt%2Fc2unWjizEpptQqM8g%3Dreserved=0
> >>>
> >>>Current drivers in upstream solves this by having the opt-out flag
> >>>on their CS ioctl. This has the downside that very often the CS
> >>>which must actually stall for the implicit fence is run a while
> >>>after the implicit fence point was logically sampled per the api
> >>>spec (vk passes an explicit syncobj around for that afaiui), and so
> >>>results in oversync. Converting the implicit sync fences into a
> >>>snap-shot sync_file is actually accurate.
> >>>
> >>> - Simillar we need to be able to set the exclusive implicit fence.
> >>>Current drivers again do this with a CS ioctl flag, with again the
> >>>same problems that the time the CS happens additional dependencies
> >>>have been added. An explicit ioctl to only insert a sync_file (while
> >>>respecting the rules for how exclusive and shared fence slots must
> >>>be update in struct dma_resv) is much better. This is proposed here:
> >>>
> >>>
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cf026055f523d4e4df95b08d936410e39%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600475351085536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=wFGNyeL1YSpkebf1L1DDb2euihf1fvmR9G8cfywrpVU%3Dreserved=0
> >>>
> >>> These three pieces together allow userspace to fully control implicit
> >>> fencing and remove all unecessary stall points due to them.
> >>>
> >>> Well, as much as the implicit fencing model fundamentally allows:

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Daniel Vetter
On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
 wrote:
>
> On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  wrote:
> >
> > WARNING: Absolutely untested beyond "gcc isn't dying in agony".
> >
> > Implicit fencing done properly needs to treat the implicit fencing
> > slots like a funny kind of IPC mailbox. In other words it needs to be
> > explicitly. This is the only way it will mesh well with explicit
> > fencing userspace like vk, and it's also the bare minimum required to
> > be able to manage anything else that wants to use the same buffer on
> > multiple engines in parallel, and still be able to share it through
> > implicit sync.
> >
> > amdgpu completely lacks such an uapi. Fix this.
> >
> > Luckily the concept of ignoring implicit fences exists already, and
> > takes care of all the complexities of making sure that non-optional
> > fences (like bo moves) are not ignored. This support was added in
> >
> > commit 177ae09b5d699a5ebd1cafcee78889db968abf54
> > Author: Andres Rodriguez 
> > Date:   Fri Sep 15 20:44:06 2017 -0400
> >
> > drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2
> >
> > Unfortuantely it's the wrong semantics, because it's a bo flag and
> > disables implicit sync on an allocated buffer completely.
> >
> > We _do_ want implicit sync, but control it explicitly. For this we
> > need a flag on the drm_file, so that a given userspace (like vulkan)
> > can manage the implicit sync slots explicitly. The other side of the
> > pipeline (compositor, other process or just different stage in a media
> > pipeline in the same process) can then either do the same, or fully
> > participate in the implicit sync as implemented by the kernel by
> > default.
> >
> > By building on the existing flag for buffers we avoid any issues with
> > opening up additional security concerns - anything this new flag here
> > allows is already.
> >
> > All drivers which supports this concept of a userspace-specific
> > opt-out of implicit sync have a flag in their CS ioctl, but in reality
> > that turned out to be a bit too inflexible. See the discussion below,
> > let's try to do a bit better for amdgpu.
> >
> > This alone only allows us to completely avoid any stalls due to
> > implicit sync, it does not yet allow us to use implicit sync as a
> > strange form of IPC for sync_file.
> >
> > For that we need two more pieces:
> >
> > - a way to get the current implicit sync fences out of a buffer. Could
> >   be done in a driver ioctl, but everyone needs this, and generally a
> >   dma-buf is involved anyway to establish the sharing. So an ioctl on
> >   the dma-buf makes a ton more sense:
> >
> >   
> > https://lore.kernel.org/dri-devel/20210520190007.534046-4-ja...@jlekstrand.net/
> >
> >   Current drivers in upstream solves this by having the opt-out flag
> >   on their CS ioctl. This has the downside that very often the CS
> >   which must actually stall for the implicit fence is run a while
> >   after the implicit fence point was logically sampled per the api
> >   spec (vk passes an explicit syncobj around for that afaiui), and so
> >   results in oversync. Converting the implicit sync fences into a
> >   snap-shot sync_file is actually accurate.
> >
> > - Simillar we need to be able to set the exclusive implicit fence.
> >   Current drivers again do this with a CS ioctl flag, with again the
> >   same problems that the time the CS happens additional dependencies
> >   have been added. An explicit ioctl to only insert a sync_file (while
> >   respecting the rules for how exclusive and shared fence slots must
> >   be update in struct dma_resv) is much better. This is proposed here:
> >
> >   
> > https://lore.kernel.org/dri-devel/20210520190007.534046-5-ja...@jlekstrand.net/
> >
> > These three pieces together allow userspace to fully control implicit
> > fencing and remove all unecessary stall points due to them.
> >
> > Well, as much as the implicit fencing model fundamentally allows:
> > There is only one set of fences, you can only choose to sync against
> > only writers (exclusive slot), or everyone. Hence suballocating
> > multiple buffers or anything else like this is fundamentally not
> > possible, and can only be fixed by a proper explicit fencing model.
> >
> > Aside from that caveat this model gets implicit fencing as closely to
> > explicit fencing semantics as possible:
> >
> > On the actual implementation I opted for a simple setparam ioctl, no
> > locking (just atomic reads/writes) for simplicity. There is a nice
> > flag parameter in the VM ioctl which we could use, except:
> > - it's not checked, so userspace likely passes garbage
> > - there's already a comment that userspace _does_ pass garbage in the
> >   priority field
> > So yeah unfortunately this flag parameter for setting vm flags is
> > useless, and we need to hack up a new one.
> >
> > v2: Explain why a new SETPARAM (Jason)
> >
> > v3: Bas noticed I forgot to hook up the dependency-side shortcut. We
> > need both, or 

Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-23 Thread Bas Nieuwenhuizen
On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter  wrote:
>
> WARNING: Absolutely untested beyond "gcc isn't dying in agony".
>
> Implicit fencing done properly needs to treat the implicit fencing
> slots like a funny kind of IPC mailbox. In other words it needs to be
> explicitly. This is the only way it will mesh well with explicit
> fencing userspace like vk, and it's also the bare minimum required to
> be able to manage anything else that wants to use the same buffer on
> multiple engines in parallel, and still be able to share it through
> implicit sync.
>
> amdgpu completely lacks such an uapi. Fix this.
>
> Luckily the concept of ignoring implicit fences exists already, and
> takes care of all the complexities of making sure that non-optional
> fences (like bo moves) are not ignored. This support was added in
>
> commit 177ae09b5d699a5ebd1cafcee78889db968abf54
> Author: Andres Rodriguez 
> Date:   Fri Sep 15 20:44:06 2017 -0400
>
> drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2
>
> Unfortuantely it's the wrong semantics, because it's a bo flag and
> disables implicit sync on an allocated buffer completely.
>
> We _do_ want implicit sync, but control it explicitly. For this we
> need a flag on the drm_file, so that a given userspace (like vulkan)
> can manage the implicit sync slots explicitly. The other side of the
> pipeline (compositor, other process or just different stage in a media
> pipeline in the same process) can then either do the same, or fully
> participate in the implicit sync as implemented by the kernel by
> default.
>
> By building on the existing flag for buffers we avoid any issues with
> opening up additional security concerns - anything this new flag here
> allows is already.
>
> All drivers which supports this concept of a userspace-specific
> opt-out of implicit sync have a flag in their CS ioctl, but in reality
> that turned out to be a bit too inflexible. See the discussion below,
> let's try to do a bit better for amdgpu.
>
> This alone only allows us to completely avoid any stalls due to
> implicit sync, it does not yet allow us to use implicit sync as a
> strange form of IPC for sync_file.
>
> For that we need two more pieces:
>
> - a way to get the current implicit sync fences out of a buffer. Could
>   be done in a driver ioctl, but everyone needs this, and generally a
>   dma-buf is involved anyway to establish the sharing. So an ioctl on
>   the dma-buf makes a ton more sense:
>
>   
> https://lore.kernel.org/dri-devel/20210520190007.534046-4-ja...@jlekstrand.net/
>
>   Current drivers in upstream solves this by having the opt-out flag
>   on their CS ioctl. This has the downside that very often the CS
>   which must actually stall for the implicit fence is run a while
>   after the implicit fence point was logically sampled per the api
>   spec (vk passes an explicit syncobj around for that afaiui), and so
>   results in oversync. Converting the implicit sync fences into a
>   snap-shot sync_file is actually accurate.
>
> - Simillar we need to be able to set the exclusive implicit fence.
>   Current drivers again do this with a CS ioctl flag, with again the
>   same problems that the time the CS happens additional dependencies
>   have been added. An explicit ioctl to only insert a sync_file (while
>   respecting the rules for how exclusive and shared fence slots must
>   be update in struct dma_resv) is much better. This is proposed here:
>
>   
> https://lore.kernel.org/dri-devel/20210520190007.534046-5-ja...@jlekstrand.net/
>
> These three pieces together allow userspace to fully control implicit
> fencing and remove all unecessary stall points due to them.
>
> Well, as much as the implicit fencing model fundamentally allows:
> There is only one set of fences, you can only choose to sync against
> only writers (exclusive slot), or everyone. Hence suballocating
> multiple buffers or anything else like this is fundamentally not
> possible, and can only be fixed by a proper explicit fencing model.
>
> Aside from that caveat this model gets implicit fencing as closely to
> explicit fencing semantics as possible:
>
> On the actual implementation I opted for a simple setparam ioctl, no
> locking (just atomic reads/writes) for simplicity. There is a nice
> flag parameter in the VM ioctl which we could use, except:
> - it's not checked, so userspace likely passes garbage
> - there's already a comment that userspace _does_ pass garbage in the
>   priority field
> So yeah unfortunately this flag parameter for setting vm flags is
> useless, and we need to hack up a new one.
>
> v2: Explain why a new SETPARAM (Jason)
>
> v3: Bas noticed I forgot to hook up the dependency-side shortcut. We
> need both, or this doesn't do much.
>
> v4: Rebase over the amdgpu patch to always set the implicit sync
> fences.

So I think there is still a case missing in this implementation.
Consider these 3 cases

(format: a->b: b waits on a. Yes, I know arrows are hard)


[Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

2021-06-22 Thread Daniel Vetter
WARNING: Absolutely untested beyond "gcc isn't dying in agony".

Implicit fencing done properly needs to treat the implicit fencing
slots like a funny kind of IPC mailbox. In other words it needs to be
explicitly. This is the only way it will mesh well with explicit
fencing userspace like vk, and it's also the bare minimum required to
be able to manage anything else that wants to use the same buffer on
multiple engines in parallel, and still be able to share it through
implicit sync.

amdgpu completely lacks such an uapi. Fix this.

Luckily the concept of ignoring implicit fences exists already, and
takes care of all the complexities of making sure that non-optional
fences (like bo moves) are not ignored. This support was added in

commit 177ae09b5d699a5ebd1cafcee78889db968abf54
Author: Andres Rodriguez 
Date:   Fri Sep 15 20:44:06 2017 -0400

drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2

Unfortuantely it's the wrong semantics, because it's a bo flag and
disables implicit sync on an allocated buffer completely.

We _do_ want implicit sync, but control it explicitly. For this we
need a flag on the drm_file, so that a given userspace (like vulkan)
can manage the implicit sync slots explicitly. The other side of the
pipeline (compositor, other process or just different stage in a media
pipeline in the same process) can then either do the same, or fully
participate in the implicit sync as implemented by the kernel by
default.

By building on the existing flag for buffers we avoid any issues with
opening up additional security concerns - anything this new flag here
allows is already.

All drivers which supports this concept of a userspace-specific
opt-out of implicit sync have a flag in their CS ioctl, but in reality
that turned out to be a bit too inflexible. See the discussion below,
let's try to do a bit better for amdgpu.

This alone only allows us to completely avoid any stalls due to
implicit sync, it does not yet allow us to use implicit sync as a
strange form of IPC for sync_file.

For that we need two more pieces:

- a way to get the current implicit sync fences out of a buffer. Could
  be done in a driver ioctl, but everyone needs this, and generally a
  dma-buf is involved anyway to establish the sharing. So an ioctl on
  the dma-buf makes a ton more sense:

  
https://lore.kernel.org/dri-devel/20210520190007.534046-4-ja...@jlekstrand.net/

  Current drivers in upstream solves this by having the opt-out flag
  on their CS ioctl. This has the downside that very often the CS
  which must actually stall for the implicit fence is run a while
  after the implicit fence point was logically sampled per the api
  spec (vk passes an explicit syncobj around for that afaiui), and so
  results in oversync. Converting the implicit sync fences into a
  snap-shot sync_file is actually accurate.

- Simillar we need to be able to set the exclusive implicit fence.
  Current drivers again do this with a CS ioctl flag, with again the
  same problems that the time the CS happens additional dependencies
  have been added. An explicit ioctl to only insert a sync_file (while
  respecting the rules for how exclusive and shared fence slots must
  be update in struct dma_resv) is much better. This is proposed here:

  
https://lore.kernel.org/dri-devel/20210520190007.534046-5-ja...@jlekstrand.net/

These three pieces together allow userspace to fully control implicit
fencing and remove all unecessary stall points due to them.

Well, as much as the implicit fencing model fundamentally allows:
There is only one set of fences, you can only choose to sync against
only writers (exclusive slot), or everyone. Hence suballocating
multiple buffers or anything else like this is fundamentally not
possible, and can only be fixed by a proper explicit fencing model.

Aside from that caveat this model gets implicit fencing as closely to
explicit fencing semantics as possible:

On the actual implementation I opted for a simple setparam ioctl, no
locking (just atomic reads/writes) for simplicity. There is a nice
flag parameter in the VM ioctl which we could use, except:
- it's not checked, so userspace likely passes garbage
- there's already a comment that userspace _does_ pass garbage in the
  priority field
So yeah unfortunately this flag parameter for setting vm flags is
useless, and we need to hack up a new one.

v2: Explain why a new SETPARAM (Jason)

v3: Bas noticed I forgot to hook up the dependency-side shortcut. We
need both, or this doesn't do much.

v4: Rebase over the amdgpu patch to always set the implicit sync
fences.

Cc: mesa-dev@lists.freedesktop.org
Cc: Bas Nieuwenhuizen 
Cc: Dave Airlie 
Cc: Rob Clark 
Cc: Kristian H. Kristensen 
Cc: Michel Dänzer 
Cc: Daniel Stone 
Cc: Sumit Semwal 
Cc: "Christian König" 
Cc: Alex Deucher 
Cc: Daniel Vetter 
Cc: Deepak R Varma 
Cc: Chen Li 
Cc: Kevin Wang 
Cc: Dennis Li 
Cc: Luben Tuikov 
Cc: linaro-mm-...@lists.linaro.org
Signed-off-by: Daniel Vetter 
---