Re: [RFC] Implicit vs explicit user fence sync
On Wed, May 12, 2021 at 10:23:04AM +0200, Christian König wrote: > Am 12.05.21 um 10:13 schrieb Daniel Vetter: > > On Tue, May 11, 2021 at 09:34:11PM +0200, Christian König wrote: > > > Am 11.05.21 um 18:48 schrieb Daniel Vetter: > > > > [SNIP] > > > > > Why? > > > > If you allow implicit fencing then you can end up with > > > > - an implicit userspace fence as the in-fence > > > > - but an explicit dma_fence as the out fence > > > > > > > > Which is not allowed. So there's really no way to make this work, except > > > > if you stall in the ioctl, which also doesn't work. > > > Ok, wait a second. I really don't understand what's going on here. > > > > > > The out fence is just to let the userspace know when the frame is > > > displayed. > > > Or rather when the old frame is no longer displayed so that it can be > > > reused, right? > > > > > > Then why does that need to be a dma_fence? We don't use that for memory > > > management anywhere, don't we? > > It can be a sync_file, so you can queue up the next rendering targeting > > the old backbuffer right away. On memory constraint devices where > > triple-buffering is prohibitive this is apparently a pretty cool trick or > > something like that. > > Yeah, I'm aware of that. > > But we don't really need that for device we want to interop with userspace > queues, don't we? > > > > > So you have to do an uapi change here. At that point we might as well do > > > > it right. > > > I mean in the worst case we might need to allow user fences with > > > sync_files > > > as well when that is really used outside of Android. > > > > > > But I still don't see the fundamental problem here. > > Making sync_file use implicit is more pain, it becomes sprawling pretty > > quickly. > > Agreed, but I don't see supporting sync_file and out fences as something > necessarily. > > As far as I can see we don't need to support that burden for the use cases > we have for implicit sync. > > And even if we have to it is possible to implement. > > > Anyway I think we're just turning in circles. My take is that your patch > > series here demonstrates nothing beyond "adding function calls that do > > nothing is easy", the real deal is in making it work. And I think it's > > easier to make this all work with explicit userspace fencing first and > > then worry about how we maybe make this work implicitly. Instead of what > > you propose, which is rig up a lot of scaffolding without having any idea > > whether it's even in the right place. That seems very backwards to me. > > And that's what I disagree on. > > Supporting implicit sync in the kernel for the functionality we need to > amdgpu is relatively easily. > > Change all of userspace to not rely on implicit sync any more is really hard > compared to that. > > Dropping implicit sync in userspace has a lot of advantage and should be > pushed for, but not because it is a prerequisite of user fences. Yeah but if we support userspace fences with implicit sync only, because that's right now the only thing amdgpu needs, then we make really, really sure the transition to explicit sync will never happen. Also if the goal really is to only do the minimal thing to keep amdgpu working on existing compositors, then imo the right solution is to just do the kernel-augmented userspace submit I've proposed. Not this. The kernel augmented userspace submit keeps all dma_fence stuff working _exactly_ like it currently does, including drm/scheduler sorting depenendencies and everything. So from a "will it work" point of view a much more solid solution. And it doesn't require us to create an extremely warty uapi where if you happen to use implicit userspace fences alot of things kinda stop working, for no reason. So if youre overall goal is to just not touch any compositor/winsys protocol at all, there _is_ a very clean solution imo. -Daniel > Regards, > Christian. > > > > > Plus I really don't like retro-fitting userspace fences into implicit sync > > and sync_file and everything. But that's not the main issue I have here. > > -Daniel > > > > > Regards, > > > Christian. > > > > > > > Of course if you only care about some specific compositors (or maybe > > > > only > > > > the -amdgpu Xorg driver even) then this isn't a concern, but atomic is > > > > cross-driver so we can't do that. Or at least I don't see a way how to > > > > do > > > > this without causing endless amounts of fun down the road. > > > > > > > > > > So I have a plan here, what was yours? > > > > > As far as I see that should still work perfectly fine and I have the > > > > > strong > > > > > feeling I'm missing something here. > > > > > > > > > > > > Transporting fences between processes is not the fundamental > > > > > > > problem here, > > > > > > > but rather the question how we represent all this in the kernel? > > > > > > > > > > > > > > In other words I think what you outlined above is just > > > > > > > approaching it from > > > > > > > the wrong
Re: [RFC] Implicit vs explicit user fence sync
Am 12.05.21 um 10:13 schrieb Daniel Vetter: On Tue, May 11, 2021 at 09:34:11PM +0200, Christian König wrote: Am 11.05.21 um 18:48 schrieb Daniel Vetter: [SNIP] Why? If you allow implicit fencing then you can end up with - an implicit userspace fence as the in-fence - but an explicit dma_fence as the out fence Which is not allowed. So there's really no way to make this work, except if you stall in the ioctl, which also doesn't work. Ok, wait a second. I really don't understand what's going on here. The out fence is just to let the userspace know when the frame is displayed. Or rather when the old frame is no longer displayed so that it can be reused, right? Then why does that need to be a dma_fence? We don't use that for memory management anywhere, don't we? It can be a sync_file, so you can queue up the next rendering targeting the old backbuffer right away. On memory constraint devices where triple-buffering is prohibitive this is apparently a pretty cool trick or something like that. Yeah, I'm aware of that. But we don't really need that for device we want to interop with userspace queues, don't we? So you have to do an uapi change here. At that point we might as well do it right. I mean in the worst case we might need to allow user fences with sync_files as well when that is really used outside of Android. But I still don't see the fundamental problem here. Making sync_file use implicit is more pain, it becomes sprawling pretty quickly. Agreed, but I don't see supporting sync_file and out fences as something necessarily. As far as I can see we don't need to support that burden for the use cases we have for implicit sync. And even if we have to it is possible to implement. Anyway I think we're just turning in circles. My take is that your patch series here demonstrates nothing beyond "adding function calls that do nothing is easy", the real deal is in making it work. And I think it's easier to make this all work with explicit userspace fencing first and then worry about how we maybe make this work implicitly. Instead of what you propose, which is rig up a lot of scaffolding without having any idea whether it's even in the right place. That seems very backwards to me. And that's what I disagree on. Supporting implicit sync in the kernel for the functionality we need to amdgpu is relatively easily. Change all of userspace to not rely on implicit sync any more is really hard compared to that. Dropping implicit sync in userspace has a lot of advantage and should be pushed for, but not because it is a prerequisite of user fences. Regards, Christian. Plus I really don't like retro-fitting userspace fences into implicit sync and sync_file and everything. But that's not the main issue I have here. -Daniel Regards, Christian. Of course if you only care about some specific compositors (or maybe only the -amdgpu Xorg driver even) then this isn't a concern, but atomic is cross-driver so we can't do that. Or at least I don't see a way how to do this without causing endless amounts of fun down the road. So I have a plan here, what was yours? As far as I see that should still work perfectly fine and I have the strong feeling I'm missing something here. Transporting fences between processes is not the fundamental problem here, but rather the question how we represent all this in the kernel? In other words I think what you outlined above is just approaching it from the wrong side again. Instead of looking what the kernel needs to support this you take a look at userspace and the requirements there. Uh ... that was my idea here? That's why I put "build userspace fences in userspace only" as the very first thing. Then extend to winsys and atomic/display and all these cases where things get more tricky. I agree that transporting the fences is easy, which is why it's not interesting trying to solve that problem first. Which is kinda what you're trying to do here by adding implicit userspace fences (well not even that, just a bunch of function calls without any semantics attached to them). So if there's more here, you need to flesh it out more or I just dont get what you're actually trying to demonstrate. Well I'm trying to figure out why you see it as such a problem to keep implicit sync around. As far as I can tell it is completely octagonal if we use implicit/explicit and dma_fence/user_fence. It's just a different implementation inside the kernel. See above. It falls apart with the atomic ioctl. -Daniel
Re: [RFC] Implicit vs explicit user fence sync
On Tue, May 11, 2021 at 09:34:11PM +0200, Christian König wrote: > Am 11.05.21 um 18:48 schrieb Daniel Vetter: > > [SNIP] > > > Why? > > If you allow implicit fencing then you can end up with > > - an implicit userspace fence as the in-fence > > - but an explicit dma_fence as the out fence > > > > Which is not allowed. So there's really no way to make this work, except > > if you stall in the ioctl, which also doesn't work. > > Ok, wait a second. I really don't understand what's going on here. > > The out fence is just to let the userspace know when the frame is displayed. > Or rather when the old frame is no longer displayed so that it can be > reused, right? > > Then why does that need to be a dma_fence? We don't use that for memory > management anywhere, don't we? It can be a sync_file, so you can queue up the next rendering targeting the old backbuffer right away. On memory constraint devices where triple-buffering is prohibitive this is apparently a pretty cool trick or something like that. > > So you have to do an uapi change here. At that point we might as well do > > it right. > > I mean in the worst case we might need to allow user fences with sync_files > as well when that is really used outside of Android. > > But I still don't see the fundamental problem here. Making sync_file use implicit is more pain, it becomes sprawling pretty quickly. Anyway I think we're just turning in circles. My take is that your patch series here demonstrates nothing beyond "adding function calls that do nothing is easy", the real deal is in making it work. And I think it's easier to make this all work with explicit userspace fencing first and then worry about how we maybe make this work implicitly. Instead of what you propose, which is rig up a lot of scaffolding without having any idea whether it's even in the right place. That seems very backwards to me. Plus I really don't like retro-fitting userspace fences into implicit sync and sync_file and everything. But that's not the main issue I have here. -Daniel > > Regards, > Christian. > > > Of course if you only care about some specific compositors (or maybe only > > the -amdgpu Xorg driver even) then this isn't a concern, but atomic is > > cross-driver so we can't do that. Or at least I don't see a way how to do > > this without causing endless amounts of fun down the road. > > > > > > So I have a plan here, what was yours? > > > As far as I see that should still work perfectly fine and I have the > > > strong > > > feeling I'm missing something here. > > > > > > > > Transporting fences between processes is not the fundamental problem > > > > > here, > > > > > but rather the question how we represent all this in the kernel? > > > > > > > > > > In other words I think what you outlined above is just approaching it > > > > > from > > > > > the wrong side again. Instead of looking what the kernel needs to > > > > > support > > > > > this you take a look at userspace and the requirements there. > > > > Uh ... that was my idea here? That's why I put "build userspace fences > > > > in > > > > userspace only" as the very first thing. Then extend to winsys and > > > > atomic/display and all these cases where things get more tricky. > > > > > > > > I agree that transporting the fences is easy, which is why it's not > > > > interesting trying to solve that problem first. Which is kinda what > > > > you're > > > > trying to do here by adding implicit userspace fences (well not even > > > > that, > > > > just a bunch of function calls without any semantics attached to them). > > > > > > > > So if there's more here, you need to flesh it out more or I just dont > > > > get > > > > what you're actually trying to demonstrate. > > > Well I'm trying to figure out why you see it as such a problem to keep > > > implicit sync around. > > > > > > As far as I can tell it is completely octagonal if we use > > > implicit/explicit > > > and dma_fence/user_fence. > > > > > > It's just a different implementation inside the kernel. > > See above. It falls apart with the atomic ioctl. > > -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: [RFC] Implicit vs explicit user fence sync
Am 11.05.21 um 18:48 schrieb Daniel Vetter: [SNIP] Why? If you allow implicit fencing then you can end up with - an implicit userspace fence as the in-fence - but an explicit dma_fence as the out fence Which is not allowed. So there's really no way to make this work, except if you stall in the ioctl, which also doesn't work. Ok, wait a second. I really don't understand what's going on here. The out fence is just to let the userspace know when the frame is displayed. Or rather when the old frame is no longer displayed so that it can be reused, right? Then why does that need to be a dma_fence? We don't use that for memory management anywhere, don't we? So you have to do an uapi change here. At that point we might as well do it right. I mean in the worst case we might need to allow user fences with sync_files as well when that is really used outside of Android. But I still don't see the fundamental problem here. Regards, Christian. Of course if you only care about some specific compositors (or maybe only the -amdgpu Xorg driver even) then this isn't a concern, but atomic is cross-driver so we can't do that. Or at least I don't see a way how to do this without causing endless amounts of fun down the road. So I have a plan here, what was yours? As far as I see that should still work perfectly fine and I have the strong feeling I'm missing something here. Transporting fences between processes is not the fundamental problem here, but rather the question how we represent all this in the kernel? In other words I think what you outlined above is just approaching it from the wrong side again. Instead of looking what the kernel needs to support this you take a look at userspace and the requirements there. Uh ... that was my idea here? That's why I put "build userspace fences in userspace only" as the very first thing. Then extend to winsys and atomic/display and all these cases where things get more tricky. I agree that transporting the fences is easy, which is why it's not interesting trying to solve that problem first. Which is kinda what you're trying to do here by adding implicit userspace fences (well not even that, just a bunch of function calls without any semantics attached to them). So if there's more here, you need to flesh it out more or I just dont get what you're actually trying to demonstrate. Well I'm trying to figure out why you see it as such a problem to keep implicit sync around. As far as I can tell it is completely octagonal if we use implicit/explicit and dma_fence/user_fence. It's just a different implementation inside the kernel. See above. It falls apart with the atomic ioctl. -Daniel
Re: [RFC] Implicit vs explicit user fence sync
On Tue, May 11, 2021 at 05:32:29PM +0200, Christian König wrote: > Am 11.05.21 um 16:23 schrieb Daniel Vetter: > > On Tue, May 11, 2021 at 09:47:56AM +0200, Christian König wrote: > > > Am 11.05.21 um 09:31 schrieb Daniel Vetter: > > > > [SNIP] > > > > > > And that's just the one ioctl I know is big trouble, I'm sure we'll > > > > > > find > > > > > > more funny corner cases when we roll out explicit user fencing. > > > > > I think we can just ignore sync_file. As far as it concerns me that > > > > > UAPI is > > > > > pretty much dead. > > > > Uh that's rather bold. Android is built on it. Currently atomic kms is > > > > built on it. > > > To be honest I don't think we care about Android at all. > > we = amd or we = upstream here? > > we = amd, for everybody else that is certainly a different topic. > > But for now AMD is the only one running into this problem. > > Could be that Nouveau sees this as well with the next hw generation, but who > knows? > > > > > Why is this not much of a problem if it's just within one driver? > > > Because inside the same driver I can easily add the waits before > > > submitting > > > the MM work as necessary. > > What is MM work here now? > > MM=multimedia, e.g. UVD, VCE, VCN engines on AMD hardware. > > > > > > > > Adding implicit synchronization on top of that is then rather > > > > > > > trivial. > > > > > > Well that's what I disagree with, since I already see some problems > > > > > > that I > > > > > > don't think we can overcome (the atomic ioctl is one). And that's > > > > > > with us > > > > > > only having a fairly theoretical understanding of the overall > > > > > > situation. > > > > > But how should we then ever support user fences with the atomic IOCTL? > > > > > > > > > > We can't wait in user space since that will disable the support for > > > > > waiting > > > > > in the hardware. > > > > Well, figure it out :-) > > > > > > > > This is exactly why I'm not seeing anything solved with just rolling a > > > > function call to a bunch of places, because it's pretending all things > > > > are > > > > solved when clearly that's not the case. > > > > > > > > I really think what we need is to first figure out how to support > > > > userspace fences as explicit entities across the stack, maybe with > > > > something like this order: > > > > 1. enable them purely within a single userspace driver (like vk with > > > > winsys disabled, or something else like that except not amd because > > > > there's this amdkfd split for "real" compute) > > > > 1a. including atomic ioctl, e.g. for vk direct display support this can > > > > be > > > > used without cross-process sharing, new winsys protocols and all that > > > > fun > > > > 2. figure out how to transport these userspace fences with something > > > > like > > > > drm_syncobj > > > > 2a. figure out the compat story for drivers which dont do userspace > > > > fences > > > > 2b. figure out how to absorb the overhead if the winsys/compositor > > > > doesn't > > > > support explicit sync > > > > 3. maybe figure out how to make this all happen magically with implicit > > > > sync, if we really, really care > > > > > > > > If we do 3 before we've nailed all these problems, we're just > > > > guaranteeing > > > > we'll get the wrong solutions and so we'll then have 3 ways of doing > > > > userspace fences > > > > - the butchered implicit one that didn't quite work > > > > - the explicit one > > > > - the not-so-butchered implicit one with the lessons from the properly > > > > done explicit one > > > > > > > > The thing is, if you have no idea how to integrate userspace fences > > > > explicitly into atomic ioctl, then you definitely have no idea how to do > > > > it implicitly :-) > > > Well I agree on that. But the question is still how would you do explicit > > > with atomic? > > If you supply an userpace fence (is that what we call them now) as > > in-fence, then your only allowed to get a userspace fence as out-fence. > > Yeah, that part makes perfectly sense. But I don't see the problem with > that? > > > That way we > > - don't block anywhere we shouldn't > > - don't create a dma_fence out of a userspace fence > > > > The problem is this completely breaks your "magically make implicit > > fencing with userspace fences" plan. > > Why? If you allow implicit fencing then you can end up with - an implicit userspace fence as the in-fence - but an explicit dma_fence as the out fence Which is not allowed. So there's really no way to make this work, except if you stall in the ioctl, which also doesn't work. So you have to do an uapi change here. At that point we might as well do it right. Of course if you only care about some specific compositors (or maybe only the -amdgpu Xorg driver even) then this isn't a concern, but atomic is cross-driver so we can't do that. Or at least I don't see a way how to do this without causing endless amounts of fun down the road. > > So I have a plan here,
Re: [RFC] Implicit vs explicit user fence sync
Am 11.05.21 um 16:23 schrieb Daniel Vetter: On Tue, May 11, 2021 at 09:47:56AM +0200, Christian König wrote: Am 11.05.21 um 09:31 schrieb Daniel Vetter: [SNIP] And that's just the one ioctl I know is big trouble, I'm sure we'll find more funny corner cases when we roll out explicit user fencing. I think we can just ignore sync_file. As far as it concerns me that UAPI is pretty much dead. Uh that's rather bold. Android is built on it. Currently atomic kms is built on it. To be honest I don't think we care about Android at all. we = amd or we = upstream here? we = amd, for everybody else that is certainly a different topic. But for now AMD is the only one running into this problem. Could be that Nouveau sees this as well with the next hw generation, but who knows? Why is this not much of a problem if it's just within one driver? Because inside the same driver I can easily add the waits before submitting the MM work as necessary. What is MM work here now? MM=multimedia, e.g. UVD, VCE, VCN engines on AMD hardware. Adding implicit synchronization on top of that is then rather trivial. Well that's what I disagree with, since I already see some problems that I don't think we can overcome (the atomic ioctl is one). And that's with us only having a fairly theoretical understanding of the overall situation. But how should we then ever support user fences with the atomic IOCTL? We can't wait in user space since that will disable the support for waiting in the hardware. Well, figure it out :-) This is exactly why I'm not seeing anything solved with just rolling a function call to a bunch of places, because it's pretending all things are solved when clearly that's not the case. I really think what we need is to first figure out how to support userspace fences as explicit entities across the stack, maybe with something like this order: 1. enable them purely within a single userspace driver (like vk with winsys disabled, or something else like that except not amd because there's this amdkfd split for "real" compute) 1a. including atomic ioctl, e.g. for vk direct display support this can be used without cross-process sharing, new winsys protocols and all that fun 2. figure out how to transport these userspace fences with something like drm_syncobj 2a. figure out the compat story for drivers which dont do userspace fences 2b. figure out how to absorb the overhead if the winsys/compositor doesn't support explicit sync 3. maybe figure out how to make this all happen magically with implicit sync, if we really, really care If we do 3 before we've nailed all these problems, we're just guaranteeing we'll get the wrong solutions and so we'll then have 3 ways of doing userspace fences - the butchered implicit one that didn't quite work - the explicit one - the not-so-butchered implicit one with the lessons from the properly done explicit one The thing is, if you have no idea how to integrate userspace fences explicitly into atomic ioctl, then you definitely have no idea how to do it implicitly :-) Well I agree on that. But the question is still how would you do explicit with atomic? If you supply an userpace fence (is that what we call them now) as in-fence, then your only allowed to get a userspace fence as out-fence. Yeah, that part makes perfectly sense. But I don't see the problem with that? That way we - don't block anywhere we shouldn't - don't create a dma_fence out of a userspace fence The problem is this completely breaks your "magically make implicit fencing with userspace fences" plan. Why? So I have a plan here, what was yours? As far as I see that should still work perfectly fine and I have the strong feeling I'm missing something here. Transporting fences between processes is not the fundamental problem here, but rather the question how we represent all this in the kernel? In other words I think what you outlined above is just approaching it from the wrong side again. Instead of looking what the kernel needs to support this you take a look at userspace and the requirements there. Uh ... that was my idea here? That's why I put "build userspace fences in userspace only" as the very first thing. Then extend to winsys and atomic/display and all these cases where things get more tricky. I agree that transporting the fences is easy, which is why it's not interesting trying to solve that problem first. Which is kinda what you're trying to do here by adding implicit userspace fences (well not even that, just a bunch of function calls without any semantics attached to them). So if there's more here, you need to flesh it out more or I just dont get what you're actually trying to demonstrate. Well I'm trying to figure out why you see it as such a problem to keep implicit sync around. As far as I can tell it is completely octagonal if we use implicit/explicit and dma_fence/user_fence. It's just a different implementation inside the kernel. Christian. -Daniel
Re: [RFC] Implicit vs explicit user fence sync
On Tue, May 11, 2021 at 09:47:56AM +0200, Christian König wrote: > Am 11.05.21 um 09:31 schrieb Daniel Vetter: > > [SNIP] > > > > And that's just the one ioctl I know is big trouble, I'm sure we'll find > > > > more funny corner cases when we roll out explicit user fencing. > > > I think we can just ignore sync_file. As far as it concerns me that UAPI > > > is > > > pretty much dead. > > Uh that's rather bold. Android is built on it. Currently atomic kms is > > built on it. > > To be honest I don't think we care about Android at all. we = amd or we = upstream here? > > > What we should support is drm_syncobj, but that also only as an in-fence > > > since that's what our hardware supports. > > Convince Android folks, minimally. Probably a lot more. Yes with hindsight > > we should have just gone for drm_syncobj instead of the sync_file thing, > > but hindsight and all that. > > > > This is kinda why I don't think trying to support the existing uapi with > > userspace fences underneath with some magic tricks is a good idea. It's > > just a pile of work, plus it's not really architecturally clean. > > > > > > Anotherone that looks very sketchy right now is buffer sharing between > > > > different userspace drivers, like compute <-> media (if you have some > > > > fancy AI pipeline in your media workload, as an example). > > > Yeah, we are certainly going to get that. But only inside the same driver, > > > so not much of a problem. > > Why is this not much of a problem if it's just within one driver? > > Because inside the same driver I can easily add the waits before submitting > the MM work as necessary. What is MM work here now? > > > > > Adding implicit synchronization on top of that is then rather trivial. > > > > Well that's what I disagree with, since I already see some problems > > > > that I > > > > don't think we can overcome (the atomic ioctl is one). And that's with > > > > us > > > > only having a fairly theoretical understanding of the overall situation. > > > But how should we then ever support user fences with the atomic IOCTL? > > > > > > We can't wait in user space since that will disable the support for > > > waiting > > > in the hardware. > > Well, figure it out :-) > > > > This is exactly why I'm not seeing anything solved with just rolling a > > function call to a bunch of places, because it's pretending all things are > > solved when clearly that's not the case. > > > > I really think what we need is to first figure out how to support > > userspace fences as explicit entities across the stack, maybe with > > something like this order: > > 1. enable them purely within a single userspace driver (like vk with > > winsys disabled, or something else like that except not amd because > > there's this amdkfd split for "real" compute) > > 1a. including atomic ioctl, e.g. for vk direct display support this can be > > used without cross-process sharing, new winsys protocols and all that fun > > 2. figure out how to transport these userspace fences with something like > > drm_syncobj > > 2a. figure out the compat story for drivers which dont do userspace fences > > 2b. figure out how to absorb the overhead if the winsys/compositor doesn't > > support explicit sync > > 3. maybe figure out how to make this all happen magically with implicit > > sync, if we really, really care > > > > If we do 3 before we've nailed all these problems, we're just guaranteeing > > we'll get the wrong solutions and so we'll then have 3 ways of doing > > userspace fences > > - the butchered implicit one that didn't quite work > > - the explicit one > > - the not-so-butchered implicit one with the lessons from the properly > >done explicit one > > > > The thing is, if you have no idea how to integrate userspace fences > > explicitly into atomic ioctl, then you definitely have no idea how to do > > it implicitly :-) > > Well I agree on that. But the question is still how would you do explicit > with atomic? If you supply an userpace fence (is that what we call them now) as in-fence, then your only allowed to get a userspace fence as out-fence. That way we - don't block anywhere we shouldn't - don't create a dma_fence out of a userspace fence The problem is this completely breaks your "magically make implicit fencing with userspace fences" plan. So I have a plan here, what was yours? > Transporting fences between processes is not the fundamental problem here, > but rather the question how we represent all this in the kernel? > > In other words I think what you outlined above is just approaching it from > the wrong side again. Instead of looking what the kernel needs to support > this you take a look at userspace and the requirements there. Uh ... that was my idea here? That's why I put "build userspace fences in userspace only" as the very first thing. Then extend to winsys and atomic/display and all these cases where things get more tricky. I agree that transporting the fences is easy, which is
Re: [RFC] Implicit vs explicit user fence sync
Am 11.05.21 um 09:31 schrieb Daniel Vetter: [SNIP] And that's just the one ioctl I know is big trouble, I'm sure we'll find more funny corner cases when we roll out explicit user fencing. I think we can just ignore sync_file. As far as it concerns me that UAPI is pretty much dead. Uh that's rather bold. Android is built on it. Currently atomic kms is built on it. To be honest I don't think we care about Android at all. What we should support is drm_syncobj, but that also only as an in-fence since that's what our hardware supports. Convince Android folks, minimally. Probably a lot more. Yes with hindsight we should have just gone for drm_syncobj instead of the sync_file thing, but hindsight and all that. This is kinda why I don't think trying to support the existing uapi with userspace fences underneath with some magic tricks is a good idea. It's just a pile of work, plus it's not really architecturally clean. Anotherone that looks very sketchy right now is buffer sharing between different userspace drivers, like compute <-> media (if you have some fancy AI pipeline in your media workload, as an example). Yeah, we are certainly going to get that. But only inside the same driver, so not much of a problem. Why is this not much of a problem if it's just within one driver? Because inside the same driver I can easily add the waits before submitting the MM work as necessary. Adding implicit synchronization on top of that is then rather trivial. Well that's what I disagree with, since I already see some problems that I don't think we can overcome (the atomic ioctl is one). And that's with us only having a fairly theoretical understanding of the overall situation. But how should we then ever support user fences with the atomic IOCTL? We can't wait in user space since that will disable the support for waiting in the hardware. Well, figure it out :-) This is exactly why I'm not seeing anything solved with just rolling a function call to a bunch of places, because it's pretending all things are solved when clearly that's not the case. I really think what we need is to first figure out how to support userspace fences as explicit entities across the stack, maybe with something like this order: 1. enable them purely within a single userspace driver (like vk with winsys disabled, or something else like that except not amd because there's this amdkfd split for "real" compute) 1a. including atomic ioctl, e.g. for vk direct display support this can be used without cross-process sharing, new winsys protocols and all that fun 2. figure out how to transport these userspace fences with something like drm_syncobj 2a. figure out the compat story for drivers which dont do userspace fences 2b. figure out how to absorb the overhead if the winsys/compositor doesn't support explicit sync 3. maybe figure out how to make this all happen magically with implicit sync, if we really, really care If we do 3 before we've nailed all these problems, we're just guaranteeing we'll get the wrong solutions and so we'll then have 3 ways of doing userspace fences - the butchered implicit one that didn't quite work - the explicit one - the not-so-butchered implicit one with the lessons from the properly done explicit one The thing is, if you have no idea how to integrate userspace fences explicitly into atomic ioctl, then you definitely have no idea how to do it implicitly :-) Well I agree on that. But the question is still how would you do explicit with atomic? Transporting fences between processes is not the fundamental problem here, but rather the question how we represent all this in the kernel? In other words I think what you outlined above is just approaching it from the wrong side again. Instead of looking what the kernel needs to support this you take a look at userspace and the requirements there. Regards, Christian. And "just block" might be good enough for a quick demo, it still breaks the contract. Same holds for a bunch of the winsys problems we'll have to deal with here. -Daniel Regards, Christian. Like here at intel we have internal code for compute, and we're starting to hit some interesting cases with interop with media already, but that's it. Nothing even close to desktop/winsys/kms, and that's where I expect will all the pain be at. Cheers, Daniel
Re: [RFC] Implicit vs explicit user fence sync
On Mon, May 10, 2021 at 08:12:31PM +0200, Christian König wrote: > Am 04.05.21 um 17:11 schrieb Daniel Vetter: > > On Tue, May 04, 2021 at 04:26:42PM +0200, Christian König wrote: > > > Hi Daniel, > > > > > > Am 04.05.21 um 16:15 schrieb Daniel Vetter: > > > > Hi Christian, > > > > > > > > On Tue, May 04, 2021 at 03:27:17PM +0200, Christian König wrote: > > > > > Hi guys, > > > > > > > > > > with this patch set I want to look into how much more additional work > > > > > it > > > > > would be to support implicit sync compared to only explicit sync. > > > > > > > > > > Turned out that this is much simpler than expected since the only > > > > > addition is that before a command submission or flip the kernel and > > > > > classic drivers would need to wait for the user fence to signal before > > > > > taking any locks. > > > > It's a lot more I think > > > > - sync_file/drm_syncobj still need to be supported somehow > > > You need that with explicit fences as well. > > > > > > I'm just concentrating on what extra burden implicit sync would get us. > > It's not just implicit sync. Currently the best approach we have for > > explicit sync is hiding them in drm_syncobj. Because for that all the work > > with intentional stall points and userspace submit thread already exists. > > > > None of this work has been done for sync_file. And looking at how much > > work it was to get drm_syncobj going, that will be anything but easy. > > I don't think we will want this for sync_file in the first place. > > > > > - we need userspace to handle the stall in a submit thread at least > > > > - there's nothing here that sets the sync object > > > > - implicit sync isn't just execbuf, it's everything. E.g. the various > > > > wait_bo ioctl also need to keep working, including timeout and > > > > everything > > > Good point, but that should be relatively easily to add as well. > > > > > > > - we can't stall in atomic kms where you're currently stalling, that's > > > > for > > > > sure. The uapi says "we're not stalling for fences in there", and > > > > you're > > > > breaking that. > > > Again as far as I can see we run into the same problem with explicit sync. > > > > > > So the question is where could we block for atomic modeset for user fences > > > in general? > > Nah, I have an idea. But it only works if userspace is aware, because the > > rules are essentialyl: > > > > - when you supply a userspace in-fence, then you only get a userspace > >out-fence > > - mixing in fences between dma-fence and user fence is ok > > - mixing out fences isn't > > > > And we currently do have sync_file out fence. So it's not possible to > > support implicit user fence in atomic in a way which doesn't break the > > uapi somewhere. > > > > Doing the explicit user fence support first will make that very obvious. > > > > And that's just the one ioctl I know is big trouble, I'm sure we'll find > > more funny corner cases when we roll out explicit user fencing. > > I think we can just ignore sync_file. As far as it concerns me that UAPI is > pretty much dead. Uh that's rather bold. Android is built on it. Currently atomic kms is built on it. > What we should support is drm_syncobj, but that also only as an in-fence > since that's what our hardware supports. Convince Android folks, minimally. Probably a lot more. Yes with hindsight we should have just gone for drm_syncobj instead of the sync_file thing, but hindsight and all that. This is kinda why I don't think trying to support the existing uapi with userspace fences underneath with some magic tricks is a good idea. It's just a pile of work, plus it's not really architecturally clean. > > Anotherone that looks very sketchy right now is buffer sharing between > > different userspace drivers, like compute <-> media (if you have some > > fancy AI pipeline in your media workload, as an example). > > Yeah, we are certainly going to get that. But only inside the same driver, > so not much of a problem. Why is this not much of a problem if it's just within one driver? > > > > - ... at this point I stopped pondering but there's definitely more > > > > > > > > Imo the only way we'll even get the complete is if we do the following: > > > > 1. roll out implicit sync with userspace fences on a driver-by-driver > > > > basis > > s/implicit/explicit/ > > > > But I think you got that. > > > > > > 1a. including all the winsys/modeset stuff > > > Completely agree, that's why I've split that up into individual patches. > > > > > > I'm also fine if drivers can just opt out of user fence based > > > synchronization and we return an error from dma_buf_dynamic_attach() if > > > some > > > driver says it can't handle that. > > Yeah, but that boils down to us just breaking those use-cases. Which is > > exactly what you're trying to avoid by rolling out implicit user fence I > > think. > > But we can add support to all drivers as necessary. > > > > >
Re: [RFC] Implicit vs explicit user fence sync
Am 04.05.21 um 17:11 schrieb Daniel Vetter: On Tue, May 04, 2021 at 04:26:42PM +0200, Christian König wrote: Hi Daniel, Am 04.05.21 um 16:15 schrieb Daniel Vetter: Hi Christian, On Tue, May 04, 2021 at 03:27:17PM +0200, Christian König wrote: Hi guys, with this patch set I want to look into how much more additional work it would be to support implicit sync compared to only explicit sync. Turned out that this is much simpler than expected since the only addition is that before a command submission or flip the kernel and classic drivers would need to wait for the user fence to signal before taking any locks. It's a lot more I think - sync_file/drm_syncobj still need to be supported somehow You need that with explicit fences as well. I'm just concentrating on what extra burden implicit sync would get us. It's not just implicit sync. Currently the best approach we have for explicit sync is hiding them in drm_syncobj. Because for that all the work with intentional stall points and userspace submit thread already exists. None of this work has been done for sync_file. And looking at how much work it was to get drm_syncobj going, that will be anything but easy. I don't think we will want this for sync_file in the first place. - we need userspace to handle the stall in a submit thread at least - there's nothing here that sets the sync object - implicit sync isn't just execbuf, it's everything. E.g. the various wait_bo ioctl also need to keep working, including timeout and everything Good point, but that should be relatively easily to add as well. - we can't stall in atomic kms where you're currently stalling, that's for sure. The uapi says "we're not stalling for fences in there", and you're breaking that. Again as far as I can see we run into the same problem with explicit sync. So the question is where could we block for atomic modeset for user fences in general? Nah, I have an idea. But it only works if userspace is aware, because the rules are essentialyl: - when you supply a userspace in-fence, then you only get a userspace out-fence - mixing in fences between dma-fence and user fence is ok - mixing out fences isn't And we currently do have sync_file out fence. So it's not possible to support implicit user fence in atomic in a way which doesn't break the uapi somewhere. Doing the explicit user fence support first will make that very obvious. And that's just the one ioctl I know is big trouble, I'm sure we'll find more funny corner cases when we roll out explicit user fencing. I think we can just ignore sync_file. As far as it concerns me that UAPI is pretty much dead. What we should support is drm_syncobj, but that also only as an in-fence since that's what our hardware supports. Anotherone that looks very sketchy right now is buffer sharing between different userspace drivers, like compute <-> media (if you have some fancy AI pipeline in your media workload, as an example). Yeah, we are certainly going to get that. But only inside the same driver, so not much of a problem. - ... at this point I stopped pondering but there's definitely more Imo the only way we'll even get the complete is if we do the following: 1. roll out implicit sync with userspace fences on a driver-by-driver basis s/implicit/explicit/ But I think you got that. 1a. including all the winsys/modeset stuff Completely agree, that's why I've split that up into individual patches. I'm also fine if drivers can just opt out of user fence based synchronization and we return an error from dma_buf_dynamic_attach() if some driver says it can't handle that. Yeah, but that boils down to us just breaking those use-cases. Which is exactly what you're trying to avoid by rolling out implicit user fence I think. But we can add support to all drivers as necessary. 2. roll out support for userspace fences to drm_syncobj timeline for interop, both across process/userspace and across drivers 2a. including all the winsys/modeset stuff, but hopefully that's largely solved with 1. already. Correct, but again we need this for explicit fencing as well. 3. only then try to figure out how to retroshoehorn this into implicit sync, and whether that even makes sense. Because doing 3 before we've done 1&2 for at least 2 drivers (2 because interop fun across drivers) is just praying that this time around we're not collectively idiots and can correctly predict the future. That never worked :-) For this prototype this patch set doesn't implement any user fence synchronization at all, but just assumes that faulting user pages is sufficient to make sure that we can wait for user space to finish submitting the work. If necessary this can be made even more strict, the only use case I could find which blocks this is the radeon driver and that should be handle able. This of course doesn't give you the same semantic as the classic implicit sync to
Re: [RFC] Implicit vs explicit user fence sync
On Tue, May 04, 2021 at 04:26:42PM +0200, Christian König wrote: > Hi Daniel, > > Am 04.05.21 um 16:15 schrieb Daniel Vetter: > > Hi Christian, > > > > On Tue, May 04, 2021 at 03:27:17PM +0200, Christian König wrote: > > > Hi guys, > > > > > > with this patch set I want to look into how much more additional work it > > > would be to support implicit sync compared to only explicit sync. > > > > > > Turned out that this is much simpler than expected since the only > > > addition is that before a command submission or flip the kernel and > > > classic drivers would need to wait for the user fence to signal before > > > taking any locks. > > It's a lot more I think > > - sync_file/drm_syncobj still need to be supported somehow > > You need that with explicit fences as well. > > I'm just concentrating on what extra burden implicit sync would get us. It's not just implicit sync. Currently the best approach we have for explicit sync is hiding them in drm_syncobj. Because for that all the work with intentional stall points and userspace submit thread already exists. None of this work has been done for sync_file. And looking at how much work it was to get drm_syncobj going, that will be anything but easy. > > - we need userspace to handle the stall in a submit thread at least > > - there's nothing here that sets the sync object > > - implicit sync isn't just execbuf, it's everything. E.g. the various > >wait_bo ioctl also need to keep working, including timeout and > >everything > > Good point, but that should be relatively easily to add as well. > > > - we can't stall in atomic kms where you're currently stalling, that's for > >sure. The uapi says "we're not stalling for fences in there", and you're > >breaking that. > > Again as far as I can see we run into the same problem with explicit sync. > > So the question is where could we block for atomic modeset for user fences > in general? Nah, I have an idea. But it only works if userspace is aware, because the rules are essentialyl: - when you supply a userspace in-fence, then you only get a userspace out-fence - mixing in fences between dma-fence and user fence is ok - mixing out fences isn't And we currently do have sync_file out fence. So it's not possible to support implicit user fence in atomic in a way which doesn't break the uapi somewhere. Doing the explicit user fence support first will make that very obvious. And that's just the one ioctl I know is big trouble, I'm sure we'll find more funny corner cases when we roll out explicit user fencing. Anotherone that looks very sketchy right now is buffer sharing between different userspace drivers, like compute <-> media (if you have some fancy AI pipeline in your media workload, as an example). > > - ... at this point I stopped pondering but there's definitely more > > > > Imo the only way we'll even get the complete is if we do the following: > > 1. roll out implicit sync with userspace fences on a driver-by-driver basis s/implicit/explicit/ But I think you got that. > > 1a. including all the winsys/modeset stuff > > Completely agree, that's why I've split that up into individual patches. > > I'm also fine if drivers can just opt out of user fence based > synchronization and we return an error from dma_buf_dynamic_attach() if some > driver says it can't handle that. Yeah, but that boils down to us just breaking those use-cases. Which is exactly what you're trying to avoid by rolling out implicit user fence I think. > > 2. roll out support for userspace fences to drm_syncobj timeline for > > interop, both across process/userspace and across drivers > > 2a. including all the winsys/modeset stuff, but hopefully that's > > largely solved with 1. already. > > Correct, but again we need this for explicit fencing as well. > > > 3. only then try to figure out how to retroshoehorn this into implicit > > sync, and whether that even makes sense. > > > > Because doing 3 before we've done 1&2 for at least 2 drivers (2 because > > interop fun across drivers) is just praying that this time around we're > > not collectively idiots and can correctly predict the future. That never > > worked :-) > > > > > For this prototype this patch set doesn't implement any user fence > > > synchronization at all, but just assumes that faulting user pages is > > > sufficient to make sure that we can wait for user space to finish > > > submitting the work. If necessary this can be made even more strict, the > > > only use case I could find which blocks this is the radeon driver and > > > that should be handle able. > > > > > > This of course doesn't give you the same semantic as the classic > > > implicit sync to guarantee that you have exclusive access to a buffers, > > > but this is also not necessary. > > > > > > So I think the conclusion should be that we don't need to concentrate on > > > implicit vs. explicit sync, but rather how to get the
Re: [RFC] Implicit vs explicit user fence sync
Hi Daniel, Am 04.05.21 um 16:15 schrieb Daniel Vetter: Hi Christian, On Tue, May 04, 2021 at 03:27:17PM +0200, Christian König wrote: Hi guys, with this patch set I want to look into how much more additional work it would be to support implicit sync compared to only explicit sync. Turned out that this is much simpler than expected since the only addition is that before a command submission or flip the kernel and classic drivers would need to wait for the user fence to signal before taking any locks. It's a lot more I think - sync_file/drm_syncobj still need to be supported somehow You need that with explicit fences as well. I'm just concentrating on what extra burden implicit sync would get us. - we need userspace to handle the stall in a submit thread at least - there's nothing here that sets the sync object - implicit sync isn't just execbuf, it's everything. E.g. the various wait_bo ioctl also need to keep working, including timeout and everything Good point, but that should be relatively easily to add as well. - we can't stall in atomic kms where you're currently stalling, that's for sure. The uapi says "we're not stalling for fences in there", and you're breaking that. Again as far as I can see we run into the same problem with explicit sync. So the question is where could we block for atomic modeset for user fences in general? - ... at this point I stopped pondering but there's definitely more Imo the only way we'll even get the complete is if we do the following: 1. roll out implicit sync with userspace fences on a driver-by-driver basis 1a. including all the winsys/modeset stuff Completely agree, that's why I've split that up into individual patches. I'm also fine if drivers can just opt out of user fence based synchronization and we return an error from dma_buf_dynamic_attach() if some driver says it can't handle that. 2. roll out support for userspace fences to drm_syncobj timeline for interop, both across process/userspace and across drivers 2a. including all the winsys/modeset stuff, but hopefully that's largely solved with 1. already. Correct, but again we need this for explicit fencing as well. 3. only then try to figure out how to retroshoehorn this into implicit sync, and whether that even makes sense. Because doing 3 before we've done 1&2 for at least 2 drivers (2 because interop fun across drivers) is just praying that this time around we're not collectively idiots and can correctly predict the future. That never worked :-) For this prototype this patch set doesn't implement any user fence synchronization at all, but just assumes that faulting user pages is sufficient to make sure that we can wait for user space to finish submitting the work. If necessary this can be made even more strict, the only use case I could find which blocks this is the radeon driver and that should be handle able. This of course doesn't give you the same semantic as the classic implicit sync to guarantee that you have exclusive access to a buffers, but this is also not necessary. So I think the conclusion should be that we don't need to concentrate on implicit vs. explicit sync, but rather how to get the synchronization and timeout signalling figured out in general. I'm not sure what exactly you're proving here aside from "it's possible to roll out a function with ill-defined semantics to all drivers". This really is a lot harder than just this one function and just this one patch set. No it isn't. The hard part is getting the user sync stuff up in general. Adding implicit synchronization on top of that is then rather trivial. Christian. -Daniel ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC] Implicit vs explicit user fence sync
Hi Christian, On Tue, May 04, 2021 at 03:27:17PM +0200, Christian König wrote: > Hi guys, > > with this patch set I want to look into how much more additional work it > would be to support implicit sync compared to only explicit sync. > > Turned out that this is much simpler than expected since the only > addition is that before a command submission or flip the kernel and > classic drivers would need to wait for the user fence to signal before > taking any locks. It's a lot more I think - sync_file/drm_syncobj still need to be supported somehow - we need userspace to handle the stall in a submit thread at least - there's nothing here that sets the sync object - implicit sync isn't just execbuf, it's everything. E.g. the various wait_bo ioctl also need to keep working, including timeout and everything - we can't stall in atomic kms where you're currently stalling, that's for sure. The uapi says "we're not stalling for fences in there", and you're breaking that. - ... at this point I stopped pondering but there's definitely more Imo the only way we'll even get the complete is if we do the following: 1. roll out implicit sync with userspace fences on a driver-by-driver basis 1a. including all the winsys/modeset stuff 2. roll out support for userspace fences to drm_syncobj timeline for interop, both across process/userspace and across drivers 2a. including all the winsys/modeset stuff, but hopefully that's largely solved with 1. already. 3. only then try to figure out how to retroshoehorn this into implicit sync, and whether that even makes sense. Because doing 3 before we've done 1&2 for at least 2 drivers (2 because interop fun across drivers) is just praying that this time around we're not collectively idiots and can correctly predict the future. That never worked :-) > For this prototype this patch set doesn't implement any user fence > synchronization at all, but just assumes that faulting user pages is > sufficient to make sure that we can wait for user space to finish > submitting the work. If necessary this can be made even more strict, the > only use case I could find which blocks this is the radeon driver and > that should be handle able. > > This of course doesn't give you the same semantic as the classic > implicit sync to guarantee that you have exclusive access to a buffers, > but this is also not necessary. > > So I think the conclusion should be that we don't need to concentrate on > implicit vs. explicit sync, but rather how to get the synchronization > and timeout signalling figured out in general. I'm not sure what exactly you're proving here aside from "it's possible to roll out a function with ill-defined semantics to all drivers". This really is a lot harder than just this one function and just this one patch set. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[RFC] Implicit vs explicit user fence sync
Hi guys, with this patch set I want to look into how much more additional work it would be to support implicit sync compared to only explicit sync. Turned out that this is much simpler than expected since the only addition is that before a command submission or flip the kernel and classic drivers would need to wait for the user fence to signal before taking any locks. For this prototype this patch set doesn't implement any user fence synchronization at all, but just assumes that faulting user pages is sufficient to make sure that we can wait for user space to finish submitting the work. If necessary this can be made even more strict, the only use case I could find which blocks this is the radeon driver and that should be handle able. This of course doesn't give you the same semantic as the classic implicit sync to guarantee that you have exclusive access to a buffers, but this is also not necessary. So I think the conclusion should be that we don't need to concentrate on implicit vs. explicit sync, but rather how to get the synchronization and timeout signalling figured out in general. Regards, Christian. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel