Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi
Am 23.06.21 um 17:12 schrieb Daniel Vetter: On Wed, Jun 23, 2021 at 05:07:17PM +0200, Christian König wrote: Am 23.06.21 um 17:03 schrieb Daniel Vetter: On Wed, Jun 23, 2021 at 04:58:27PM +0200, Bas Nieuwenhuizen wrote: On Wed, Jun 23, 2021 at 4:50 PM Daniel Vetter wrote: On Wed, Jun 23, 2021 at 4:02 PM Christian König wrote: Am 23.06.21 um 15:49 schrieb Daniel Vetter: On Wed, Jun 23, 2021 at 3:44 PM Christian König wrote: Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen: On Wed, Jun 23, 2021 at 2:59 PM Christian König wrote: Am 23.06.21 um 14:18 schrieb Daniel Vetter: On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen wrote: On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter wrote: WARNING: Absolutely untested beyond "gcc isn't dying in agony". Implicit fencing done properly needs to treat the implicit fencing slots like a funny kind of IPC mailbox. In other words it needs to be explicitly. This is the only way it will mesh well with explicit fencing userspace like vk, and it's also the bare minimum required to be able to manage anything else that wants to use the same buffer on multiple engines in parallel, and still be able to share it through implicit sync. amdgpu completely lacks such an uapi. Fix this. Luckily the concept of ignoring implicit fences exists already, and takes care of all the complexities of making sure that non-optional fences (like bo moves) are not ignored. This support was added in commit 177ae09b5d699a5ebd1cafcee78889db968abf54 Author: Andres Rodriguez Date: Fri Sep 15 20:44:06 2017 -0400 drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2 Unfortuantely it's the wrong semantics, because it's a bo flag and disables implicit sync on an allocated buffer completely. We _do_ want implicit sync, but control it explicitly. For this we need a flag on the drm_file, so that a given userspace (like vulkan) can manage the implicit sync slots explicitly. The other side of the pipeline (compositor, other process or just different stage in a media pipeline in the same process) can then either do the same, or fully participate in the implicit sync as implemented by the kernel by default. By building on the existing flag for buffers we avoid any issues with opening up additional security concerns - anything this new flag here allows is already. All drivers which supports this concept of a userspace-specific opt-out of implicit sync have a flag in their CS ioctl, but in reality that turned out to be a bit too inflexible. See the discussion below, let's try to do a bit better for amdgpu. This alone only allows us to completely avoid any stalls due to implicit sync, it does not yet allow us to use implicit sync as a strange form of IPC for sync_file. For that we need two more pieces: - a way to get the current implicit sync fences out of a buffer. Could be done in a driver ioctl, but everyone needs this, and generally a dma-buf is involved anyway to establish the sharing. So an ioctl on the dma-buf makes a ton more sense: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cbbfb6a2fd1ab4ab448d608d936594aae%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600579485508943%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=EKK8SOxfcGPIcH3EFEN656G76vQnn2jGlGyMkuY4f5k%3Dreserved=0 Current drivers in upstream solves this by having the opt-out flag on their CS ioctl. This has the downside that very often the CS which must actually stall for the implicit fence is run a while after the implicit fence point was logically sampled per the api spec (vk passes an explicit syncobj around for that afaiui), and so results in oversync. Converting the implicit sync fences into a snap-shot sync_file is actually accurate. - Simillar we need to be able to set the exclusive implicit fence. Current drivers again do this with a CS ioctl flag, with again the same problems that the time the CS happens additional dependencies have been added. An explicit ioctl to only insert a sync_file (while respecting the rules for how exclusive and shared fence slots must be update in struct dma_resv) is much better. This is proposed here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cbbfb6a2fd1ab4ab448d608d936594aae%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600579485508943%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=nrsIXQYXctDMFvYuUfM2LO%2BozUfzopfs34ZpTKjNCKo%3Dreserved=0 These three pieces together allow userspace to fully control implicit fencing and remove all unecessary stall
Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi
Am 23.06.21 um 17:03 schrieb Daniel Vetter: On Wed, Jun 23, 2021 at 04:58:27PM +0200, Bas Nieuwenhuizen wrote: On Wed, Jun 23, 2021 at 4:50 PM Daniel Vetter wrote: On Wed, Jun 23, 2021 at 4:02 PM Christian König wrote: Am 23.06.21 um 15:49 schrieb Daniel Vetter: On Wed, Jun 23, 2021 at 3:44 PM Christian König wrote: Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen: On Wed, Jun 23, 2021 at 2:59 PM Christian König wrote: Am 23.06.21 um 14:18 schrieb Daniel Vetter: On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen wrote: On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter wrote: WARNING: Absolutely untested beyond "gcc isn't dying in agony". Implicit fencing done properly needs to treat the implicit fencing slots like a funny kind of IPC mailbox. In other words it needs to be explicitly. This is the only way it will mesh well with explicit fencing userspace like vk, and it's also the bare minimum required to be able to manage anything else that wants to use the same buffer on multiple engines in parallel, and still be able to share it through implicit sync. amdgpu completely lacks such an uapi. Fix this. Luckily the concept of ignoring implicit fences exists already, and takes care of all the complexities of making sure that non-optional fences (like bo moves) are not ignored. This support was added in commit 177ae09b5d699a5ebd1cafcee78889db968abf54 Author: Andres Rodriguez Date: Fri Sep 15 20:44:06 2017 -0400 drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2 Unfortuantely it's the wrong semantics, because it's a bo flag and disables implicit sync on an allocated buffer completely. We _do_ want implicit sync, but control it explicitly. For this we need a flag on the drm_file, so that a given userspace (like vulkan) can manage the implicit sync slots explicitly. The other side of the pipeline (compositor, other process or just different stage in a media pipeline in the same process) can then either do the same, or fully participate in the implicit sync as implemented by the kernel by default. By building on the existing flag for buffers we avoid any issues with opening up additional security concerns - anything this new flag here allows is already. All drivers which supports this concept of a userspace-specific opt-out of implicit sync have a flag in their CS ioctl, but in reality that turned out to be a bit too inflexible. See the discussion below, let's try to do a bit better for amdgpu. This alone only allows us to completely avoid any stalls due to implicit sync, it does not yet allow us to use implicit sync as a strange form of IPC for sync_file. For that we need two more pieces: - a way to get the current implicit sync fences out of a buffer. Could be done in a driver ioctl, but everyone needs this, and generally a dma-buf is involved anyway to establish the sharing. So an ioctl on the dma-buf makes a ton more sense: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C517f0d3467324e7ce05008d936581f60%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600574408265873%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=gntXLzlrqPxYj4Q3mQflD3arT9ad40S9AqsvtOXV4nk%3Dreserved=0 Current drivers in upstream solves this by having the opt-out flag on their CS ioctl. This has the downside that very often the CS which must actually stall for the implicit fence is run a while after the implicit fence point was logically sampled per the api spec (vk passes an explicit syncobj around for that afaiui), and so results in oversync. Converting the implicit sync fences into a snap-shot sync_file is actually accurate. - Simillar we need to be able to set the exclusive implicit fence. Current drivers again do this with a CS ioctl flag, with again the same problems that the time the CS happens additional dependencies have been added. An explicit ioctl to only insert a sync_file (while respecting the rules for how exclusive and shared fence slots must be update in struct dma_resv) is much better. This is proposed here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C517f0d3467324e7ce05008d936581f60%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600574408265873%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=YtqHT756jlt5NX7Ydr3Kk1UMTb98nQhlcOlrnr%2B48HE%3Dreserved=0 These three pieces together allow userspace to fully control implicit fencing and remove all unecessary stall points due to them. Well, as much as the implicit fencing model fundamentally allows: There is only one set of fences, you can only
Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi
Am 23.06.21 um 15:49 schrieb Daniel Vetter: On Wed, Jun 23, 2021 at 3:44 PM Christian König wrote: Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen: On Wed, Jun 23, 2021 at 2:59 PM Christian König wrote: Am 23.06.21 um 14:18 schrieb Daniel Vetter: On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen wrote: On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter wrote: WARNING: Absolutely untested beyond "gcc isn't dying in agony". Implicit fencing done properly needs to treat the implicit fencing slots like a funny kind of IPC mailbox. In other words it needs to be explicitly. This is the only way it will mesh well with explicit fencing userspace like vk, and it's also the bare minimum required to be able to manage anything else that wants to use the same buffer on multiple engines in parallel, and still be able to share it through implicit sync. amdgpu completely lacks such an uapi. Fix this. Luckily the concept of ignoring implicit fences exists already, and takes care of all the complexities of making sure that non-optional fences (like bo moves) are not ignored. This support was added in commit 177ae09b5d699a5ebd1cafcee78889db968abf54 Author: Andres Rodriguez Date: Fri Sep 15 20:44:06 2017 -0400 drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2 Unfortuantely it's the wrong semantics, because it's a bo flag and disables implicit sync on an allocated buffer completely. We _do_ want implicit sync, but control it explicitly. For this we need a flag on the drm_file, so that a given userspace (like vulkan) can manage the implicit sync slots explicitly. The other side of the pipeline (compositor, other process or just different stage in a media pipeline in the same process) can then either do the same, or fully participate in the implicit sync as implemented by the kernel by default. By building on the existing flag for buffers we avoid any issues with opening up additional security concerns - anything this new flag here allows is already. All drivers which supports this concept of a userspace-specific opt-out of implicit sync have a flag in their CS ioctl, but in reality that turned out to be a bit too inflexible. See the discussion below, let's try to do a bit better for amdgpu. This alone only allows us to completely avoid any stalls due to implicit sync, it does not yet allow us to use implicit sync as a strange form of IPC for sync_file. For that we need two more pieces: - a way to get the current implicit sync fences out of a buffer. Could be done in a driver ioctl, but everyone needs this, and generally a dma-buf is involved anyway to establish the sharing. So an ioctl on the dma-buf makes a ton more sense: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C83dbdd0a1eb8442cbf7108d9364db51e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600529684040802%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=fbdwtutEj93anZp6Pshs277QoMTHZxIy0Yl54T95rCw%3Dreserved=0 Current drivers in upstream solves this by having the opt-out flag on their CS ioctl. This has the downside that very often the CS which must actually stall for the implicit fence is run a while after the implicit fence point was logically sampled per the api spec (vk passes an explicit syncobj around for that afaiui), and so results in oversync. Converting the implicit sync fences into a snap-shot sync_file is actually accurate. - Simillar we need to be able to set the exclusive implicit fence. Current drivers again do this with a CS ioctl flag, with again the same problems that the time the CS happens additional dependencies have been added. An explicit ioctl to only insert a sync_file (while respecting the rules for how exclusive and shared fence slots must be update in struct dma_resv) is much better. This is proposed here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C83dbdd0a1eb8442cbf7108d9364db51e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600529684040802%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=vv%2BREnWorjoTOwrD1jH1GHVQcjPy1oesaophsz056aI%3Dreserved=0 These three pieces together allow userspace to fully control implicit fencing and remove all unecessary stall points due to them. Well, as much as the implicit fencing model fundamentally allows: There is only one set of fences, you can only choose to sync against only writers (exclusive slot), or everyone. Hence suballocating multiple buffers or anything else like this is fundamentally not possible, and can only be fixed by a proper explicit fencing model. Aside from that caveat
Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi
Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen: On Wed, Jun 23, 2021 at 2:59 PM Christian König wrote: Am 23.06.21 um 14:18 schrieb Daniel Vetter: On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen wrote: On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter wrote: WARNING: Absolutely untested beyond "gcc isn't dying in agony". Implicit fencing done properly needs to treat the implicit fencing slots like a funny kind of IPC mailbox. In other words it needs to be explicitly. This is the only way it will mesh well with explicit fencing userspace like vk, and it's also the bare minimum required to be able to manage anything else that wants to use the same buffer on multiple engines in parallel, and still be able to share it through implicit sync. amdgpu completely lacks such an uapi. Fix this. Luckily the concept of ignoring implicit fences exists already, and takes care of all the complexities of making sure that non-optional fences (like bo moves) are not ignored. This support was added in commit 177ae09b5d699a5ebd1cafcee78889db968abf54 Author: Andres Rodriguez Date: Fri Sep 15 20:44:06 2017 -0400 drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2 Unfortuantely it's the wrong semantics, because it's a bo flag and disables implicit sync on an allocated buffer completely. We _do_ want implicit sync, but control it explicitly. For this we need a flag on the drm_file, so that a given userspace (like vulkan) can manage the implicit sync slots explicitly. The other side of the pipeline (compositor, other process or just different stage in a media pipeline in the same process) can then either do the same, or fully participate in the implicit sync as implemented by the kernel by default. By building on the existing flag for buffers we avoid any issues with opening up additional security concerns - anything this new flag here allows is already. All drivers which supports this concept of a userspace-specific opt-out of implicit sync have a flag in their CS ioctl, but in reality that turned out to be a bit too inflexible. See the discussion below, let's try to do a bit better for amdgpu. This alone only allows us to completely avoid any stalls due to implicit sync, it does not yet allow us to use implicit sync as a strange form of IPC for sync_file. For that we need two more pieces: - a way to get the current implicit sync fences out of a buffer. Could be done in a driver ioctl, but everyone needs this, and generally a dma-buf is involved anyway to establish the sharing. So an ioctl on the dma-buf makes a ton more sense: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Ca401fc4551f045c95d8808d9364c38f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600523287217723%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=L8KCz8711Y2qZx0%2FJWT6HSg4o6OMhn%2BC4U2IR06nViE%3Dreserved=0 Current drivers in upstream solves this by having the opt-out flag on their CS ioctl. This has the downside that very often the CS which must actually stall for the implicit fence is run a while after the implicit fence point was logically sampled per the api spec (vk passes an explicit syncobj around for that afaiui), and so results in oversync. Converting the implicit sync fences into a snap-shot sync_file is actually accurate. - Simillar we need to be able to set the exclusive implicit fence. Current drivers again do this with a CS ioctl flag, with again the same problems that the time the CS happens additional dependencies have been added. An explicit ioctl to only insert a sync_file (while respecting the rules for how exclusive and shared fence slots must be update in struct dma_resv) is much better. This is proposed here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Ca401fc4551f045c95d8808d9364c38f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600523287227719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=8Ws%2B573T5rj9Bs08%2BQB5CbIAsWgo36hYiH%2Fd0dPcJeg%3Dreserved=0 These three pieces together allow userspace to fully control implicit fencing and remove all unecessary stall points due to them. Well, as much as the implicit fencing model fundamentally allows: There is only one set of fences, you can only choose to sync against only writers (exclusive slot), or everyone. Hence suballocating multiple buffers or anything else like this is fundamentally not possible, and can only be fixed by a proper explicit fencing model. Aside from that caveat this model gets implicit fencing as closely to explicit fencing semantics as possible: On the actual
Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi
Am 23.06.21 um 14:18 schrieb Daniel Vetter: On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen wrote: On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter wrote: WARNING: Absolutely untested beyond "gcc isn't dying in agony". Implicit fencing done properly needs to treat the implicit fencing slots like a funny kind of IPC mailbox. In other words it needs to be explicitly. This is the only way it will mesh well with explicit fencing userspace like vk, and it's also the bare minimum required to be able to manage anything else that wants to use the same buffer on multiple engines in parallel, and still be able to share it through implicit sync. amdgpu completely lacks such an uapi. Fix this. Luckily the concept of ignoring implicit fences exists already, and takes care of all the complexities of making sure that non-optional fences (like bo moves) are not ignored. This support was added in commit 177ae09b5d699a5ebd1cafcee78889db968abf54 Author: Andres Rodriguez Date: Fri Sep 15 20:44:06 2017 -0400 drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2 Unfortuantely it's the wrong semantics, because it's a bo flag and disables implicit sync on an allocated buffer completely. We _do_ want implicit sync, but control it explicitly. For this we need a flag on the drm_file, so that a given userspace (like vulkan) can manage the implicit sync slots explicitly. The other side of the pipeline (compositor, other process or just different stage in a media pipeline in the same process) can then either do the same, or fully participate in the implicit sync as implemented by the kernel by default. By building on the existing flag for buffers we avoid any issues with opening up additional security concerns - anything this new flag here allows is already. All drivers which supports this concept of a userspace-specific opt-out of implicit sync have a flag in their CS ioctl, but in reality that turned out to be a bit too inflexible. See the discussion below, let's try to do a bit better for amdgpu. This alone only allows us to completely avoid any stalls due to implicit sync, it does not yet allow us to use implicit sync as a strange form of IPC for sync_file. For that we need two more pieces: - a way to get the current implicit sync fences out of a buffer. Could be done in a driver ioctl, but everyone needs this, and generally a dma-buf is involved anyway to establish the sharing. So an ioctl on the dma-buf makes a ton more sense: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cf026055f523d4e4df95b08d936410e39%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600475351085536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=gUnM8%2Fulx%2B%2BDLxByO%2F0V3cLqt%2Fc2unWjizEpptQqM8g%3Dreserved=0 Current drivers in upstream solves this by having the opt-out flag on their CS ioctl. This has the downside that very often the CS which must actually stall for the implicit fence is run a while after the implicit fence point was logically sampled per the api spec (vk passes an explicit syncobj around for that afaiui), and so results in oversync. Converting the implicit sync fences into a snap-shot sync_file is actually accurate. - Simillar we need to be able to set the exclusive implicit fence. Current drivers again do this with a CS ioctl flag, with again the same problems that the time the CS happens additional dependencies have been added. An explicit ioctl to only insert a sync_file (while respecting the rules for how exclusive and shared fence slots must be update in struct dma_resv) is much better. This is proposed here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cf026055f523d4e4df95b08d936410e39%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600475351085536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=wFGNyeL1YSpkebf1L1DDb2euihf1fvmR9G8cfywrpVU%3Dreserved=0 These three pieces together allow userspace to fully control implicit fencing and remove all unecessary stall points due to them. Well, as much as the implicit fencing model fundamentally allows: There is only one set of fences, you can only choose to sync against only writers (exclusive slot), or everyone. Hence suballocating multiple buffers or anything else like this is fundamentally not possible, and can only be fixed by a proper explicit fencing model. Aside from that caveat this model gets implicit fencing as closely to explicit fencing semantics as possible: On the actual implementation I opted for a simple setparam ioctl, no locking (just atomic reads/writes) for simplicity. There is a nice flag
[Mesa-dev] [PATCH] dma-buf: Document dma-buf implicit fencing/resv fencing rules
Docs for struct dma_resv are fairly clear: "A reservation object can have attached one exclusive fence (normally associated with write operations) or N shared fences (read operations)." https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html#reservation-objects Furthermore a review across all of upstream. First of render drivers and how they set implicit fences: - nouveau follows this contract, see in validate_fini_no_ticket() nouveau_bo_fence(nvbo, fence, !!b->write_domains); and that last boolean controls whether the exclusive or shared fence slot is used. - radeon follows this contract by setting p->relocs[i].tv.num_shared = !r->write_domain; in radeon_cs_parser_relocs(), which ensures that the call to ttm_eu_fence_buffer_objects() in radeon_cs_parser_fini() will do the right thing. - vmwgfx seems to follow this contract with the shotgun approach of always setting ttm_val_buf->num_shared = 0, which means ttm_eu_fence_buffer_objects() will only use the exclusive slot. - etnaviv follows this contract, as can be trivially seen by looking at submit_attach_object_fences() - i915 is a bit a convoluted maze with multiple paths leading to i915_vma_move_to_active(). Which sets the exclusive flag if EXEC_OBJECT_WRITE is set. This can either come as a buffer flag for softpin mode, or through the write_domain when using relocations. It follows this contract. - lima follows this contract, see lima_gem_submit() which sets the exclusive fence when the LIMA_SUBMIT_BO_WRITE flag is set for that bo - msm follows this contract, see msm_gpu_submit() which sets the exclusive flag when the MSM_SUBMIT_BO_WRITE is set for that buffer - panfrost follows this contract with the shotgun approach of just always setting the exclusive fence, see panfrost_attach_object_fences(). Benefits of a single engine I guess - v3d follows this contract with the same shotgun approach in v3d_attach_fences_and_unlock_reservation(), but it has at least an XXX comment that maybe this should be improved - v4c uses the same shotgun approach of always setting an exclusive fence, see vc4_update_bo_seqnos() - vgem also follows this contract, see vgem_fence_attach_ioctl() and the VGEM_FENCE_WRITE. This is used in some igts to validate prime sharing with i915.ko without the need of a 2nd gpu - vritio follows this contract again with the shotgun approach of always setting an exclusive fence, see virtio_gpu_array_add_fence() This covers the setting of the exclusive fences when writing. Synchronizing against the exclusive fence is a lot more tricky, and I only spot checked a few: - i915 does it, with the optional EXEC_OBJECT_ASYNC to skip all implicit dependencies (which is used by vulkan) - etnaviv does this. Implicit dependencies are collected in submit_fence_sync(), again with an opt-out flag ETNA_SUBMIT_NO_IMPLICIT. These are then picked up in etnaviv_sched_dependency which is the drm_sched_backend_ops->dependency callback. - v4c seems to not do much here, maybe gets away with it by not having a scheduler and only a single engine. Since all newer broadcom chips than the OG vc4 use v3d for rendering, which follows this contract, the impact of this issue is fairly small. - v3d does this using the drm_gem_fence_array_add_implicit() helper, which then it's drm_sched_backend_ops->dependency callback v3d_job_dependency() picks up. - panfrost is nice here and tracks the implicit fences in panfrost_job->implicit_fences, which again the drm_sched_backend_ops->dependency callback panfrost_job_dependency() picks up. It is mildly questionable though since it only picks up exclusive fences in panfrost_acquire_object_fences(), but not buggy in practice because it also always sets the exclusive fence. It should pick up both sets of fences, just in case there's ever going to be a 2nd gpu in a SoC with a mali gpu. Or maybe a mali SoC with a pcie port and a real gpu, which might actually happen eventually. A bug, but easy to fix. Should probably use the drm_gem_fence_array_add_implicit() helper. - lima is nice an easy, uses drm_gem_fence_array_add_implicit() and the same schema as v3d. - msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT, but because it doesn't use the drm/scheduler it handles fences from the wrong context with a synchronous dma_fence_wait. See submit_fence_sync() leading to msm_gem_sync_object(). Investing into a scheduler might be a good idea. - all the remaining drivers are ttm based, where I hope they do appropriately obey implicit fences already. I didn't do the full audit there because a) not follow the contract would confuse ttm quite well and b) reading non-standard scheduler and submit code which isn't based on drm/scheduler is a pain. Onwards to the display side. - Any driver using the drm_gem_plane_helper_prepare_fb() helper will correctly.
Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi
On Wed, Jun 23, 2021 at 05:07:17PM +0200, Christian König wrote: > Am 23.06.21 um 17:03 schrieb Daniel Vetter: > > On Wed, Jun 23, 2021 at 04:58:27PM +0200, Bas Nieuwenhuizen wrote: > > > On Wed, Jun 23, 2021 at 4:50 PM Daniel Vetter > > > wrote: > > > > On Wed, Jun 23, 2021 at 4:02 PM Christian König > > > > wrote: > > > > > Am 23.06.21 um 15:49 schrieb Daniel Vetter: > > > > > > On Wed, Jun 23, 2021 at 3:44 PM Christian König > > > > > > wrote: > > > > > > > Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen: > > > > > > > > On Wed, Jun 23, 2021 at 2:59 PM Christian König > > > > > > > > wrote: > > > > > > > > > Am 23.06.21 um 14:18 schrieb Daniel Vetter: > > > > > > > > > > On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen > > > > > > > > > > wrote: > > > > > > > > > > > On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter > > > > > > > > > > > wrote: > > > > > > > > > > > > WARNING: Absolutely untested beyond "gcc isn't dying in > > > > > > > > > > > > agony". > > > > > > > > > > > > > > > > > > > > > > > > Implicit fencing done properly needs to treat the > > > > > > > > > > > > implicit fencing > > > > > > > > > > > > slots like a funny kind of IPC mailbox. In other words > > > > > > > > > > > > it needs to be > > > > > > > > > > > > explicitly. This is the only way it will mesh well with > > > > > > > > > > > > explicit > > > > > > > > > > > > fencing userspace like vk, and it's also the bare > > > > > > > > > > > > minimum required to > > > > > > > > > > > > be able to manage anything else that wants to use the > > > > > > > > > > > > same buffer on > > > > > > > > > > > > multiple engines in parallel, and still be able to > > > > > > > > > > > > share it through > > > > > > > > > > > > implicit sync. > > > > > > > > > > > > > > > > > > > > > > > > amdgpu completely lacks such an uapi. Fix this. > > > > > > > > > > > > > > > > > > > > > > > > Luckily the concept of ignoring implicit fences exists > > > > > > > > > > > > already, and > > > > > > > > > > > > takes care of all the complexities of making sure that > > > > > > > > > > > > non-optional > > > > > > > > > > > > fences (like bo moves) are not ignored. This support > > > > > > > > > > > > was added in > > > > > > > > > > > > > > > > > > > > > > > > commit 177ae09b5d699a5ebd1cafcee78889db968abf54 > > > > > > > > > > > > Author: Andres Rodriguez > > > > > > > > > > > > Date: Fri Sep 15 20:44:06 2017 -0400 > > > > > > > > > > > > > > > > > > > > > > > > drm/amdgpu: introduce > > > > > > > > > > > > AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2 > > > > > > > > > > > > > > > > > > > > > > > > Unfortuantely it's the wrong semantics, because it's a > > > > > > > > > > > > bo flag and > > > > > > > > > > > > disables implicit sync on an allocated buffer > > > > > > > > > > > > completely. > > > > > > > > > > > > > > > > > > > > > > > > We _do_ want implicit sync, but control it explicitly. > > > > > > > > > > > > For this we > > > > > > > > > > > > need a flag on the drm_file, so that a given userspace > > > > > > > > > > > > (like vulkan) > > > > > > > > > > > > can manage the implicit sync slots explicitly. The > > > > > > > > > > > > other side of the > > > > > > > > > > > > pipeline (compositor, other process or just different > > > > > > > > > > > > stage in a media > > > > > > > > > > > > pipeline in the same process) can then either do the > > > > > > > > > > > > same, or fully > > > > > > > > > > > > participate in the implicit sync as implemented by the > > > > > > > > > > > > kernel by > > > > > > > > > > > > default. > > > > > > > > > > > > > > > > > > > > > > > > By building on the existing flag for buffers we avoid > > > > > > > > > > > > any issues with > > > > > > > > > > > > opening up additional security concerns - anything this > > > > > > > > > > > > new flag here > > > > > > > > > > > > allows is already. > > > > > > > > > > > > > > > > > > > > > > > > All drivers which supports this concept of a > > > > > > > > > > > > userspace-specific > > > > > > > > > > > > opt-out of implicit sync have a flag in their CS ioctl, > > > > > > > > > > > > but in reality > > > > > > > > > > > > that turned out to be a bit too inflexible. See the > > > > > > > > > > > > discussion below, > > > > > > > > > > > > let's try to do a bit better for amdgpu. > > > > > > > > > > > > > > > > > > > > > > > > This alone only allows us to completely avoid any > > > > > > > > > > > > stalls due to > > > > > > > > > > > > implicit sync, it does not yet allow us to use implicit > > > > > > > > > > > > sync as a > > > > > > > > > > > > strange form of IPC for sync_file. > > > > > > > > > > > > > > > > > > > > > > > > For that we need two more pieces: > > > > > > > > > > > > > > > > > > > > > > > > - a way to get the current implicit sync fences out of > > > > > > > > > > > > a buffer. Could > > > > > > > > > > > > be done in a driver ioctl, but everyone needs > > > > > > > > > > > > this, and generally a > > > > > > >
Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi
On Wed, Jun 23, 2021 at 04:58:27PM +0200, Bas Nieuwenhuizen wrote: > On Wed, Jun 23, 2021 at 4:50 PM Daniel Vetter wrote: > > > > On Wed, Jun 23, 2021 at 4:02 PM Christian König > > wrote: > > > > > > Am 23.06.21 um 15:49 schrieb Daniel Vetter: > > > > On Wed, Jun 23, 2021 at 3:44 PM Christian König > > > > wrote: > > > >> Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen: > > > >>> On Wed, Jun 23, 2021 at 2:59 PM Christian König > > > >>> wrote: > > > Am 23.06.21 um 14:18 schrieb Daniel Vetter: > > > > On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen > > > > wrote: > > > >> On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter > > > >> wrote: > > > >>> WARNING: Absolutely untested beyond "gcc isn't dying in agony". > > > >>> > > > >>> Implicit fencing done properly needs to treat the implicit fencing > > > >>> slots like a funny kind of IPC mailbox. In other words it needs > > > >>> to be > > > >>> explicitly. This is the only way it will mesh well with explicit > > > >>> fencing userspace like vk, and it's also the bare minimum > > > >>> required to > > > >>> be able to manage anything else that wants to use the same buffer > > > >>> on > > > >>> multiple engines in parallel, and still be able to share it > > > >>> through > > > >>> implicit sync. > > > >>> > > > >>> amdgpu completely lacks such an uapi. Fix this. > > > >>> > > > >>> Luckily the concept of ignoring implicit fences exists already, > > > >>> and > > > >>> takes care of all the complexities of making sure that > > > >>> non-optional > > > >>> fences (like bo moves) are not ignored. This support was added in > > > >>> > > > >>> commit 177ae09b5d699a5ebd1cafcee78889db968abf54 > > > >>> Author: Andres Rodriguez > > > >>> Date: Fri Sep 15 20:44:06 2017 -0400 > > > >>> > > > >>>drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2 > > > >>> > > > >>> Unfortuantely it's the wrong semantics, because it's a bo flag and > > > >>> disables implicit sync on an allocated buffer completely. > > > >>> > > > >>> We _do_ want implicit sync, but control it explicitly. For this we > > > >>> need a flag on the drm_file, so that a given userspace (like > > > >>> vulkan) > > > >>> can manage the implicit sync slots explicitly. The other side of > > > >>> the > > > >>> pipeline (compositor, other process or just different stage in a > > > >>> media > > > >>> pipeline in the same process) can then either do the same, or > > > >>> fully > > > >>> participate in the implicit sync as implemented by the kernel by > > > >>> default. > > > >>> > > > >>> By building on the existing flag for buffers we avoid any issues > > > >>> with > > > >>> opening up additional security concerns - anything this new flag > > > >>> here > > > >>> allows is already. > > > >>> > > > >>> All drivers which supports this concept of a userspace-specific > > > >>> opt-out of implicit sync have a flag in their CS ioctl, but in > > > >>> reality > > > >>> that turned out to be a bit too inflexible. See the discussion > > > >>> below, > > > >>> let's try to do a bit better for amdgpu. > > > >>> > > > >>> This alone only allows us to completely avoid any stalls due to > > > >>> implicit sync, it does not yet allow us to use implicit sync as a > > > >>> strange form of IPC for sync_file. > > > >>> > > > >>> For that we need two more pieces: > > > >>> > > > >>> - a way to get the current implicit sync fences out of a buffer. > > > >>> Could > > > >>> be done in a driver ioctl, but everyone needs this, and > > > >>> generally a > > > >>> dma-buf is involved anyway to establish the sharing. So an > > > >>> ioctl on > > > >>> the dma-buf makes a ton more sense: > > > >>> > > > >>> > > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C83dbdd0a1eb8442cbf7108d9364db51e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600529684040802%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=fbdwtutEj93anZp6Pshs277QoMTHZxIy0Yl54T95rCw%3Dreserved=0 > > > >>> > > > >>> Current drivers in upstream solves this by having the > > > >>> opt-out flag > > > >>> on their CS ioctl. This has the downside that very often the > > > >>> CS > > > >>> which must actually stall for the implicit fence is run a > > > >>> while > > > >>> after the implicit fence point was logically sampled per the > > > >>> api > > > >>> spec (vk passes an explicit syncobj around for that afaiui), > > > >>> and so > > > >>>
Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi
On Wed, Jun 23, 2021 at 4:50 PM Daniel Vetter wrote: > > On Wed, Jun 23, 2021 at 4:02 PM Christian König > wrote: > > > > Am 23.06.21 um 15:49 schrieb Daniel Vetter: > > > On Wed, Jun 23, 2021 at 3:44 PM Christian König > > > wrote: > > >> Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen: > > >>> On Wed, Jun 23, 2021 at 2:59 PM Christian König > > >>> wrote: > > Am 23.06.21 um 14:18 schrieb Daniel Vetter: > > > On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen > > > wrote: > > >> On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter > > >> wrote: > > >>> WARNING: Absolutely untested beyond "gcc isn't dying in agony". > > >>> > > >>> Implicit fencing done properly needs to treat the implicit fencing > > >>> slots like a funny kind of IPC mailbox. In other words it needs to > > >>> be > > >>> explicitly. This is the only way it will mesh well with explicit > > >>> fencing userspace like vk, and it's also the bare minimum required > > >>> to > > >>> be able to manage anything else that wants to use the same buffer on > > >>> multiple engines in parallel, and still be able to share it through > > >>> implicit sync. > > >>> > > >>> amdgpu completely lacks such an uapi. Fix this. > > >>> > > >>> Luckily the concept of ignoring implicit fences exists already, and > > >>> takes care of all the complexities of making sure that non-optional > > >>> fences (like bo moves) are not ignored. This support was added in > > >>> > > >>> commit 177ae09b5d699a5ebd1cafcee78889db968abf54 > > >>> Author: Andres Rodriguez > > >>> Date: Fri Sep 15 20:44:06 2017 -0400 > > >>> > > >>>drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2 > > >>> > > >>> Unfortuantely it's the wrong semantics, because it's a bo flag and > > >>> disables implicit sync on an allocated buffer completely. > > >>> > > >>> We _do_ want implicit sync, but control it explicitly. For this we > > >>> need a flag on the drm_file, so that a given userspace (like vulkan) > > >>> can manage the implicit sync slots explicitly. The other side of the > > >>> pipeline (compositor, other process or just different stage in a > > >>> media > > >>> pipeline in the same process) can then either do the same, or fully > > >>> participate in the implicit sync as implemented by the kernel by > > >>> default. > > >>> > > >>> By building on the existing flag for buffers we avoid any issues > > >>> with > > >>> opening up additional security concerns - anything this new flag > > >>> here > > >>> allows is already. > > >>> > > >>> All drivers which supports this concept of a userspace-specific > > >>> opt-out of implicit sync have a flag in their CS ioctl, but in > > >>> reality > > >>> that turned out to be a bit too inflexible. See the discussion > > >>> below, > > >>> let's try to do a bit better for amdgpu. > > >>> > > >>> This alone only allows us to completely avoid any stalls due to > > >>> implicit sync, it does not yet allow us to use implicit sync as a > > >>> strange form of IPC for sync_file. > > >>> > > >>> For that we need two more pieces: > > >>> > > >>> - a way to get the current implicit sync fences out of a buffer. > > >>> Could > > >>> be done in a driver ioctl, but everyone needs this, and > > >>> generally a > > >>> dma-buf is involved anyway to establish the sharing. So an > > >>> ioctl on > > >>> the dma-buf makes a ton more sense: > > >>> > > >>> > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C83dbdd0a1eb8442cbf7108d9364db51e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600529684040802%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=fbdwtutEj93anZp6Pshs277QoMTHZxIy0Yl54T95rCw%3Dreserved=0 > > >>> > > >>> Current drivers in upstream solves this by having the opt-out > > >>> flag > > >>> on their CS ioctl. This has the downside that very often the CS > > >>> which must actually stall for the implicit fence is run a while > > >>> after the implicit fence point was logically sampled per the > > >>> api > > >>> spec (vk passes an explicit syncobj around for that afaiui), > > >>> and so > > >>> results in oversync. Converting the implicit sync fences into a > > >>> snap-shot sync_file is actually accurate. > > >>> > > >>> - Simillar we need to be able to set the exclusive implicit fence. > > >>> Current drivers again do this with a CS ioctl flag, with again > > >>> the > > >>> same problems that the time the CS happens additional
Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi
On Wed, Jun 23, 2021 at 4:02 PM Christian König wrote: > > Am 23.06.21 um 15:49 schrieb Daniel Vetter: > > On Wed, Jun 23, 2021 at 3:44 PM Christian König > > wrote: > >> Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen: > >>> On Wed, Jun 23, 2021 at 2:59 PM Christian König > >>> wrote: > Am 23.06.21 um 14:18 schrieb Daniel Vetter: > > On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen > > wrote: > >> On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter > >> wrote: > >>> WARNING: Absolutely untested beyond "gcc isn't dying in agony". > >>> > >>> Implicit fencing done properly needs to treat the implicit fencing > >>> slots like a funny kind of IPC mailbox. In other words it needs to be > >>> explicitly. This is the only way it will mesh well with explicit > >>> fencing userspace like vk, and it's also the bare minimum required to > >>> be able to manage anything else that wants to use the same buffer on > >>> multiple engines in parallel, and still be able to share it through > >>> implicit sync. > >>> > >>> amdgpu completely lacks such an uapi. Fix this. > >>> > >>> Luckily the concept of ignoring implicit fences exists already, and > >>> takes care of all the complexities of making sure that non-optional > >>> fences (like bo moves) are not ignored. This support was added in > >>> > >>> commit 177ae09b5d699a5ebd1cafcee78889db968abf54 > >>> Author: Andres Rodriguez > >>> Date: Fri Sep 15 20:44:06 2017 -0400 > >>> > >>>drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2 > >>> > >>> Unfortuantely it's the wrong semantics, because it's a bo flag and > >>> disables implicit sync on an allocated buffer completely. > >>> > >>> We _do_ want implicit sync, but control it explicitly. For this we > >>> need a flag on the drm_file, so that a given userspace (like vulkan) > >>> can manage the implicit sync slots explicitly. The other side of the > >>> pipeline (compositor, other process or just different stage in a media > >>> pipeline in the same process) can then either do the same, or fully > >>> participate in the implicit sync as implemented by the kernel by > >>> default. > >>> > >>> By building on the existing flag for buffers we avoid any issues with > >>> opening up additional security concerns - anything this new flag here > >>> allows is already. > >>> > >>> All drivers which supports this concept of a userspace-specific > >>> opt-out of implicit sync have a flag in their CS ioctl, but in reality > >>> that turned out to be a bit too inflexible. See the discussion below, > >>> let's try to do a bit better for amdgpu. > >>> > >>> This alone only allows us to completely avoid any stalls due to > >>> implicit sync, it does not yet allow us to use implicit sync as a > >>> strange form of IPC for sync_file. > >>> > >>> For that we need two more pieces: > >>> > >>> - a way to get the current implicit sync fences out of a buffer. Could > >>> be done in a driver ioctl, but everyone needs this, and > >>> generally a > >>> dma-buf is involved anyway to establish the sharing. So an ioctl > >>> on > >>> the dma-buf makes a ton more sense: > >>> > >>> > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7C83dbdd0a1eb8442cbf7108d9364db51e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600529684040802%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=fbdwtutEj93anZp6Pshs277QoMTHZxIy0Yl54T95rCw%3Dreserved=0 > >>> > >>> Current drivers in upstream solves this by having the opt-out > >>> flag > >>> on their CS ioctl. This has the downside that very often the CS > >>> which must actually stall for the implicit fence is run a while > >>> after the implicit fence point was logically sampled per the api > >>> spec (vk passes an explicit syncobj around for that afaiui), and > >>> so > >>> results in oversync. Converting the implicit sync fences into a > >>> snap-shot sync_file is actually accurate. > >>> > >>> - Simillar we need to be able to set the exclusive implicit fence. > >>> Current drivers again do this with a CS ioctl flag, with again > >>> the > >>> same problems that the time the CS happens additional > >>> dependencies > >>> have been added. An explicit ioctl to only insert a sync_file > >>> (while > >>> respecting the rules for how exclusive and shared fence slots > >>> must > >>> be update in struct dma_resv) is much better. This is proposed > >>> here: > >>> > >>> > >>>
Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi
On Wed, Jun 23, 2021 at 3:44 PM Christian König wrote: > > Am 23.06.21 um 15:38 schrieb Bas Nieuwenhuizen: > > On Wed, Jun 23, 2021 at 2:59 PM Christian König > > wrote: > >> Am 23.06.21 um 14:18 schrieb Daniel Vetter: > >>> On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen > >>> wrote: > On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter > wrote: > > WARNING: Absolutely untested beyond "gcc isn't dying in agony". > > > > Implicit fencing done properly needs to treat the implicit fencing > > slots like a funny kind of IPC mailbox. In other words it needs to be > > explicitly. This is the only way it will mesh well with explicit > > fencing userspace like vk, and it's also the bare minimum required to > > be able to manage anything else that wants to use the same buffer on > > multiple engines in parallel, and still be able to share it through > > implicit sync. > > > > amdgpu completely lacks such an uapi. Fix this. > > > > Luckily the concept of ignoring implicit fences exists already, and > > takes care of all the complexities of making sure that non-optional > > fences (like bo moves) are not ignored. This support was added in > > > > commit 177ae09b5d699a5ebd1cafcee78889db968abf54 > > Author: Andres Rodriguez > > Date: Fri Sep 15 20:44:06 2017 -0400 > > > > drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2 > > > > Unfortuantely it's the wrong semantics, because it's a bo flag and > > disables implicit sync on an allocated buffer completely. > > > > We _do_ want implicit sync, but control it explicitly. For this we > > need a flag on the drm_file, so that a given userspace (like vulkan) > > can manage the implicit sync slots explicitly. The other side of the > > pipeline (compositor, other process or just different stage in a media > > pipeline in the same process) can then either do the same, or fully > > participate in the implicit sync as implemented by the kernel by > > default. > > > > By building on the existing flag for buffers we avoid any issues with > > opening up additional security concerns - anything this new flag here > > allows is already. > > > > All drivers which supports this concept of a userspace-specific > > opt-out of implicit sync have a flag in their CS ioctl, but in reality > > that turned out to be a bit too inflexible. See the discussion below, > > let's try to do a bit better for amdgpu. > > > > This alone only allows us to completely avoid any stalls due to > > implicit sync, it does not yet allow us to use implicit sync as a > > strange form of IPC for sync_file. > > > > For that we need two more pieces: > > > > - a way to get the current implicit sync fences out of a buffer. Could > > be done in a driver ioctl, but everyone needs this, and generally a > > dma-buf is involved anyway to establish the sharing. So an ioctl on > > the dma-buf makes a ton more sense: > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Ca401fc4551f045c95d8808d9364c38f6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600523287217723%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=L8KCz8711Y2qZx0%2FJWT6HSg4o6OMhn%2BC4U2IR06nViE%3Dreserved=0 > > > > Current drivers in upstream solves this by having the opt-out flag > > on their CS ioctl. This has the downside that very often the CS > > which must actually stall for the implicit fence is run a while > > after the implicit fence point was logically sampled per the api > > spec (vk passes an explicit syncobj around for that afaiui), and so > > results in oversync. Converting the implicit sync fences into a > > snap-shot sync_file is actually accurate. > > > > - Simillar we need to be able to set the exclusive implicit fence. > > Current drivers again do this with a CS ioctl flag, with again the > > same problems that the time the CS happens additional dependencies > > have been added. An explicit ioctl to only insert a sync_file (while > > respecting the rules for how exclusive and shared fence slots must > > be update in struct dma_resv) is much better. This is proposed here: > > > > > >
Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi
On Wed, Jun 23, 2021 at 2:59 PM Christian König wrote: > > Am 23.06.21 um 14:18 schrieb Daniel Vetter: > > On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen > > wrote: > >> On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter > >> wrote: > >>> WARNING: Absolutely untested beyond "gcc isn't dying in agony". > >>> > >>> Implicit fencing done properly needs to treat the implicit fencing > >>> slots like a funny kind of IPC mailbox. In other words it needs to be > >>> explicitly. This is the only way it will mesh well with explicit > >>> fencing userspace like vk, and it's also the bare minimum required to > >>> be able to manage anything else that wants to use the same buffer on > >>> multiple engines in parallel, and still be able to share it through > >>> implicit sync. > >>> > >>> amdgpu completely lacks such an uapi. Fix this. > >>> > >>> Luckily the concept of ignoring implicit fences exists already, and > >>> takes care of all the complexities of making sure that non-optional > >>> fences (like bo moves) are not ignored. This support was added in > >>> > >>> commit 177ae09b5d699a5ebd1cafcee78889db968abf54 > >>> Author: Andres Rodriguez > >>> Date: Fri Sep 15 20:44:06 2017 -0400 > >>> > >>> drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2 > >>> > >>> Unfortuantely it's the wrong semantics, because it's a bo flag and > >>> disables implicit sync on an allocated buffer completely. > >>> > >>> We _do_ want implicit sync, but control it explicitly. For this we > >>> need a flag on the drm_file, so that a given userspace (like vulkan) > >>> can manage the implicit sync slots explicitly. The other side of the > >>> pipeline (compositor, other process or just different stage in a media > >>> pipeline in the same process) can then either do the same, or fully > >>> participate in the implicit sync as implemented by the kernel by > >>> default. > >>> > >>> By building on the existing flag for buffers we avoid any issues with > >>> opening up additional security concerns - anything this new flag here > >>> allows is already. > >>> > >>> All drivers which supports this concept of a userspace-specific > >>> opt-out of implicit sync have a flag in their CS ioctl, but in reality > >>> that turned out to be a bit too inflexible. See the discussion below, > >>> let's try to do a bit better for amdgpu. > >>> > >>> This alone only allows us to completely avoid any stalls due to > >>> implicit sync, it does not yet allow us to use implicit sync as a > >>> strange form of IPC for sync_file. > >>> > >>> For that we need two more pieces: > >>> > >>> - a way to get the current implicit sync fences out of a buffer. Could > >>>be done in a driver ioctl, but everyone needs this, and generally a > >>>dma-buf is involved anyway to establish the sharing. So an ioctl on > >>>the dma-buf makes a ton more sense: > >>> > >>> > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-4-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cf026055f523d4e4df95b08d936410e39%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600475351085536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=gUnM8%2Fulx%2B%2BDLxByO%2F0V3cLqt%2Fc2unWjizEpptQqM8g%3Dreserved=0 > >>> > >>>Current drivers in upstream solves this by having the opt-out flag > >>>on their CS ioctl. This has the downside that very often the CS > >>>which must actually stall for the implicit fence is run a while > >>>after the implicit fence point was logically sampled per the api > >>>spec (vk passes an explicit syncobj around for that afaiui), and so > >>>results in oversync. Converting the implicit sync fences into a > >>>snap-shot sync_file is actually accurate. > >>> > >>> - Simillar we need to be able to set the exclusive implicit fence. > >>>Current drivers again do this with a CS ioctl flag, with again the > >>>same problems that the time the CS happens additional dependencies > >>>have been added. An explicit ioctl to only insert a sync_file (while > >>>respecting the rules for how exclusive and shared fence slots must > >>>be update in struct dma_resv) is much better. This is proposed here: > >>> > >>> > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210520190007.534046-5-jason%40jlekstrand.net%2Fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cf026055f523d4e4df95b08d936410e39%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637600475351085536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=wFGNyeL1YSpkebf1L1DDb2euihf1fvmR9G8cfywrpVU%3Dreserved=0 > >>> > >>> These three pieces together allow userspace to fully control implicit > >>> fencing and remove all unecessary stall points due to them. > >>> > >>> Well, as much as the implicit fencing model fundamentally allows:
Re: [Mesa-dev] [PATCH 03/15] dma-buf: Document dma-buf implicit fencing/resv fencing rules
Am 22.06.21 um 18:54 schrieb Daniel Vetter: Docs for struct dma_resv are fairly clear: "A reservation object can have attached one exclusive fence (normally associated with write operations) or N shared fences (read operations)." https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freedesktop.org%2Fdocs%2Fdrm%2Fdriver-api%2Fdma-buf.html%23reservation-objectsdata=04%7C01%7Cchristian.koenig%40amd.com%7C42d8c70b62084a846e9508d9359e8629%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637599777264104449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=rcd4i0VpK3BhwFLzsxd66OhshdJJh3yhRo2MOqlCEBo%3Dreserved=0 Furthermore a review across all of upstream. First of render drivers and how they set implicit fences: - nouveau follows this contract, see in validate_fini_no_ticket() nouveau_bo_fence(nvbo, fence, !!b->write_domains); and that last boolean controls whether the exclusive or shared fence slot is used. - radeon follows this contract by setting p->relocs[i].tv.num_shared = !r->write_domain; in radeon_cs_parser_relocs(), which ensures that the call to ttm_eu_fence_buffer_objects() in radeon_cs_parser_fini() will do the right thing. - vmwgfx seems to follow this contract with the shotgun approach of always setting ttm_val_buf->num_shared = 0, which means ttm_eu_fence_buffer_objects() will only use the exclusive slot. - etnaviv follows this contract, as can be trivially seen by looking at submit_attach_object_fences() - i915 is a bit a convoluted maze with multiple paths leading to i915_vma_move_to_active(). Which sets the exclusive flag if EXEC_OBJECT_WRITE is set. This can either come as a buffer flag for softpin mode, or through the write_domain when using relocations. It follows this contract. - lima follows this contract, see lima_gem_submit() which sets the exclusive fence when the LIMA_SUBMIT_BO_WRITE flag is set for that bo - msm follows this contract, see msm_gpu_submit() which sets the exclusive flag when the MSM_SUBMIT_BO_WRITE is set for that buffer - panfrost follows this contract with the shotgun approach of just always setting the exclusive fence, see panfrost_attach_object_fences(). Benefits of a single engine I guess - v3d follows this contract with the same shotgun approach in v3d_attach_fences_and_unlock_reservation(), but it has at least an XXX comment that maybe this should be improved - v4c uses the same shotgun approach of always setting an exclusive fence, see vc4_update_bo_seqnos() - vgem also follows this contract, see vgem_fence_attach_ioctl() and the VGEM_FENCE_WRITE. This is used in some igts to validate prime sharing with i915.ko without the need of a 2nd gpu - vritio follows this contract again with the shotgun approach of always setting an exclusive fence, see virtio_gpu_array_add_fence() This covers the setting of the exclusive fences when writing. Synchronizing against the exclusive fence is a lot more tricky, and I only spot checked a few: - i915 does it, with the optional EXEC_OBJECT_ASYNC to skip all implicit dependencies (which is used by vulkan) - etnaviv does this. Implicit dependencies are collected in submit_fence_sync(), again with an opt-out flag ETNA_SUBMIT_NO_IMPLICIT. These are then picked up in etnaviv_sched_dependency which is the drm_sched_backend_ops->dependency callback. - v4c seems to not do much here, maybe gets away with it by not having a scheduler and only a single engine. Since all newer broadcom chips than the OG vc4 use v3d for rendering, which follows this contract, the impact of this issue is fairly small. - v3d does this using the drm_gem_fence_array_add_implicit() helper, which then it's drm_sched_backend_ops->dependency callback v3d_job_dependency() picks up. - panfrost is nice here and tracks the implicit fences in panfrost_job->implicit_fences, which again the drm_sched_backend_ops->dependency callback panfrost_job_dependency() picks up. It is mildly questionable though since it only picks up exclusive fences in panfrost_acquire_object_fences(), but not buggy in practice because it also always sets the exclusive fence. It should pick up both sets of fences, just in case there's ever going to be a 2nd gpu in a SoC with a mali gpu. Or maybe a mali SoC with a pcie port and a real gpu, which might actually happen eventually. A bug, but easy to fix. Should probably use the drm_gem_fence_array_add_implicit() helper. - lima is nice an easy, uses drm_gem_fence_array_add_implicit() and the same schema as v3d. - msm is mildly entertaining. It also supports MSM_SUBMIT_NO_IMPLICIT, but because it doesn't use the drm/scheduler it handles fences from the wrong context with a synchronous dma_fence_wait. See submit_fence_sync() leading to msm_gem_sync_object(). Investing into a
Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi
On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen wrote: > > On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter wrote: > > > > WARNING: Absolutely untested beyond "gcc isn't dying in agony". > > > > Implicit fencing done properly needs to treat the implicit fencing > > slots like a funny kind of IPC mailbox. In other words it needs to be > > explicitly. This is the only way it will mesh well with explicit > > fencing userspace like vk, and it's also the bare minimum required to > > be able to manage anything else that wants to use the same buffer on > > multiple engines in parallel, and still be able to share it through > > implicit sync. > > > > amdgpu completely lacks such an uapi. Fix this. > > > > Luckily the concept of ignoring implicit fences exists already, and > > takes care of all the complexities of making sure that non-optional > > fences (like bo moves) are not ignored. This support was added in > > > > commit 177ae09b5d699a5ebd1cafcee78889db968abf54 > > Author: Andres Rodriguez > > Date: Fri Sep 15 20:44:06 2017 -0400 > > > > drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2 > > > > Unfortuantely it's the wrong semantics, because it's a bo flag and > > disables implicit sync on an allocated buffer completely. > > > > We _do_ want implicit sync, but control it explicitly. For this we > > need a flag on the drm_file, so that a given userspace (like vulkan) > > can manage the implicit sync slots explicitly. The other side of the > > pipeline (compositor, other process or just different stage in a media > > pipeline in the same process) can then either do the same, or fully > > participate in the implicit sync as implemented by the kernel by > > default. > > > > By building on the existing flag for buffers we avoid any issues with > > opening up additional security concerns - anything this new flag here > > allows is already. > > > > All drivers which supports this concept of a userspace-specific > > opt-out of implicit sync have a flag in their CS ioctl, but in reality > > that turned out to be a bit too inflexible. See the discussion below, > > let's try to do a bit better for amdgpu. > > > > This alone only allows us to completely avoid any stalls due to > > implicit sync, it does not yet allow us to use implicit sync as a > > strange form of IPC for sync_file. > > > > For that we need two more pieces: > > > > - a way to get the current implicit sync fences out of a buffer. Could > > be done in a driver ioctl, but everyone needs this, and generally a > > dma-buf is involved anyway to establish the sharing. So an ioctl on > > the dma-buf makes a ton more sense: > > > > > > https://lore.kernel.org/dri-devel/20210520190007.534046-4-ja...@jlekstrand.net/ > > > > Current drivers in upstream solves this by having the opt-out flag > > on their CS ioctl. This has the downside that very often the CS > > which must actually stall for the implicit fence is run a while > > after the implicit fence point was logically sampled per the api > > spec (vk passes an explicit syncobj around for that afaiui), and so > > results in oversync. Converting the implicit sync fences into a > > snap-shot sync_file is actually accurate. > > > > - Simillar we need to be able to set the exclusive implicit fence. > > Current drivers again do this with a CS ioctl flag, with again the > > same problems that the time the CS happens additional dependencies > > have been added. An explicit ioctl to only insert a sync_file (while > > respecting the rules for how exclusive and shared fence slots must > > be update in struct dma_resv) is much better. This is proposed here: > > > > > > https://lore.kernel.org/dri-devel/20210520190007.534046-5-ja...@jlekstrand.net/ > > > > These three pieces together allow userspace to fully control implicit > > fencing and remove all unecessary stall points due to them. > > > > Well, as much as the implicit fencing model fundamentally allows: > > There is only one set of fences, you can only choose to sync against > > only writers (exclusive slot), or everyone. Hence suballocating > > multiple buffers or anything else like this is fundamentally not > > possible, and can only be fixed by a proper explicit fencing model. > > > > Aside from that caveat this model gets implicit fencing as closely to > > explicit fencing semantics as possible: > > > > On the actual implementation I opted for a simple setparam ioctl, no > > locking (just atomic reads/writes) for simplicity. There is a nice > > flag parameter in the VM ioctl which we could use, except: > > - it's not checked, so userspace likely passes garbage > > - there's already a comment that userspace _does_ pass garbage in the > > priority field > > So yeah unfortunately this flag parameter for setting vm flags is > > useless, and we need to hack up a new one. > > > > v2: Explain why a new SETPARAM (Jason) > > > > v3: Bas noticed I forgot to hook up the dependency-side shortcut. We > > need both, or
Re: [Mesa-dev] [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi
On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter wrote: > > WARNING: Absolutely untested beyond "gcc isn't dying in agony". > > Implicit fencing done properly needs to treat the implicit fencing > slots like a funny kind of IPC mailbox. In other words it needs to be > explicitly. This is the only way it will mesh well with explicit > fencing userspace like vk, and it's also the bare minimum required to > be able to manage anything else that wants to use the same buffer on > multiple engines in parallel, and still be able to share it through > implicit sync. > > amdgpu completely lacks such an uapi. Fix this. > > Luckily the concept of ignoring implicit fences exists already, and > takes care of all the complexities of making sure that non-optional > fences (like bo moves) are not ignored. This support was added in > > commit 177ae09b5d699a5ebd1cafcee78889db968abf54 > Author: Andres Rodriguez > Date: Fri Sep 15 20:44:06 2017 -0400 > > drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2 > > Unfortuantely it's the wrong semantics, because it's a bo flag and > disables implicit sync on an allocated buffer completely. > > We _do_ want implicit sync, but control it explicitly. For this we > need a flag on the drm_file, so that a given userspace (like vulkan) > can manage the implicit sync slots explicitly. The other side of the > pipeline (compositor, other process or just different stage in a media > pipeline in the same process) can then either do the same, or fully > participate in the implicit sync as implemented by the kernel by > default. > > By building on the existing flag for buffers we avoid any issues with > opening up additional security concerns - anything this new flag here > allows is already. > > All drivers which supports this concept of a userspace-specific > opt-out of implicit sync have a flag in their CS ioctl, but in reality > that turned out to be a bit too inflexible. See the discussion below, > let's try to do a bit better for amdgpu. > > This alone only allows us to completely avoid any stalls due to > implicit sync, it does not yet allow us to use implicit sync as a > strange form of IPC for sync_file. > > For that we need two more pieces: > > - a way to get the current implicit sync fences out of a buffer. Could > be done in a driver ioctl, but everyone needs this, and generally a > dma-buf is involved anyway to establish the sharing. So an ioctl on > the dma-buf makes a ton more sense: > > > https://lore.kernel.org/dri-devel/20210520190007.534046-4-ja...@jlekstrand.net/ > > Current drivers in upstream solves this by having the opt-out flag > on their CS ioctl. This has the downside that very often the CS > which must actually stall for the implicit fence is run a while > after the implicit fence point was logically sampled per the api > spec (vk passes an explicit syncobj around for that afaiui), and so > results in oversync. Converting the implicit sync fences into a > snap-shot sync_file is actually accurate. > > - Simillar we need to be able to set the exclusive implicit fence. > Current drivers again do this with a CS ioctl flag, with again the > same problems that the time the CS happens additional dependencies > have been added. An explicit ioctl to only insert a sync_file (while > respecting the rules for how exclusive and shared fence slots must > be update in struct dma_resv) is much better. This is proposed here: > > > https://lore.kernel.org/dri-devel/20210520190007.534046-5-ja...@jlekstrand.net/ > > These three pieces together allow userspace to fully control implicit > fencing and remove all unecessary stall points due to them. > > Well, as much as the implicit fencing model fundamentally allows: > There is only one set of fences, you can only choose to sync against > only writers (exclusive slot), or everyone. Hence suballocating > multiple buffers or anything else like this is fundamentally not > possible, and can only be fixed by a proper explicit fencing model. > > Aside from that caveat this model gets implicit fencing as closely to > explicit fencing semantics as possible: > > On the actual implementation I opted for a simple setparam ioctl, no > locking (just atomic reads/writes) for simplicity. There is a nice > flag parameter in the VM ioctl which we could use, except: > - it's not checked, so userspace likely passes garbage > - there's already a comment that userspace _does_ pass garbage in the > priority field > So yeah unfortunately this flag parameter for setting vm flags is > useless, and we need to hack up a new one. > > v2: Explain why a new SETPARAM (Jason) > > v3: Bas noticed I forgot to hook up the dependency-side shortcut. We > need both, or this doesn't do much. > > v4: Rebase over the amdgpu patch to always set the implicit sync > fences. So I think there is still a case missing in this implementation. Consider these 3 cases (format: a->b: b waits on a. Yes, I know arrows are hard)