Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
On Fri, Jun 8, 2012 at 3:56 PM, Erik Gilling konk...@android.com wrote: I guess my other thought is that implicit vs explicit is not mutually exclusive, though I'd guess there'd be interesting deadlocks to have to debug if both were in use _at the same time_. :-) I think this is an approach worth investigating. I'd like a way to either opt out of implicit sync or have a way to check if a dma-buf has an attached fence and detach it. Actually, that could work really well. Consider: * Each dma_buf has a single fence slot * on submission * the driver will extract the fence from the dma_buf and queue a wait on it. * the driver will replace that fence with it's own complettion fence before the job submission ioctl returns. * dma_buf will have two userspace ioctls: * DETACH: will return the fence as an FD to userspace and clear the fence slot in the dma_buf * ATTACH: takes a fence FD from userspace and attaches it to the dma_buf fence slot. Returns an error if the fence slot is non-empty. In the android case, we can do a detach after every submission and an attach right before. btw, I like this idea for implicit and explicit sync to coexist BR, -R ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
On Sat, Jun 9, 2012 at 12:22 AM, Erik Gilling wrote: > The current linux graphics stack does not allow synchronization > between the GPU and a camera/video decoder. ?When we've seen people > try to support this behind the scenes, they get it wrong and introduce > bugs that can take weeks to track down. ?As stated in the previous > email, one of our goals is to centrally manage synchronization so that > it's easer for people bringing up a platform to get it right. I agree that letting everyone reinvent the wheel isn't the best idea for cross-device sync - people will just get it wrong way too often. I'm not convinced yet that doing it with explicit sync points/fences and in userspace is the best solution. dri2/gem all use implicit sync points managed by the kernel in a transparent fashion, so I'm leaning towards such a sulotion for cross-device sync, too. Imo the big upside of such an implicitly sync'ed approach is that it massively simplifies cross-process protocols (i.e. for the display server). So to foster understanding of the various requirements and use-cases, could you elaborate on the pros and cons a bit and explain why you think explicit sync points managed by the userspace display server is the best approach for android? Yours, Daniel -- Daniel Vetter daniel.vetter at ffwll.ch - +41 (0) 79 364 57 48 - http://blog.ffwll.ch
[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
On Fri, Jun 8, 2012 at 3:56 PM, Erik Gilling wrote: >> I guess my other thought is that implicit vs explicit is not >> mutually exclusive, though I'd guess there'd be interesting >> deadlocks to have to debug if both were in use _at the same >> time_. :-) > > I think this is an approach worth investigating. ?I'd like a way to > either opt out of implicit sync or have a way to check if a dma-buf > has an attached fence and detach it. ?Actually, that could work really > well. Consider: > > * Each dma_buf has a single fence "slot" > * on submission > ? * the driver will extract the fence from the dma_buf and queue a wait on it. > ? * the driver will replace that fence with it's own complettion > fence before the job submission ioctl returns. > * dma_buf will have two userspace ioctls: > ? * DETACH: will return the fence as an FD to userspace and clear the > fence slot in the dma_buf > ? * ATTACH: takes a fence FD from userspace and attaches it to the > dma_buf fence slot. ?Returns an error if the fence slot is non-empty. > > In the android case, we can do a detach after every submission and an > attach right before. btw, I like this idea for implicit and explicit sync to coexist BR, -R
[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
On Fri, Jun 08, 2012 at 01:56:05PM -0700, Erik Gilling wrote: > On Thu, Jun 7, 2012 at 4:35 AM, Tom Cooksey wrote: > > The alternate is to not associate sync objects with buffers and > > have them be distinct entities, exposed to userspace. This gives > > userpsace more power and flexibility and might allow for use-cases > > which an implicit synchronization mechanism can't satisfy - I'd > > be curious to know any specifics here. > > Time and time again we've had problems with implicit synchronization > resulting in bugs where different drivers play by slightly different > implicit rules. We're convinced the best way to attack this problem > is to move as much of the command and control of synchronization as > possible into a single piece of code (the compositor in our case.) To > facilitate this we're going to be mandating this explicit approach in > the K release of Android. > > > However, every driver which > > needs to participate in the synchronization mechanism will need > > to have its interface with userspace modified to allow the sync > > objects to be passed to the drivers. This seemed like a lot of > > work to me, which is why I prefer the implicit approach. However > > I don't actually know what work is needed and think it should be > > explored. I.e. How much work is it to add explicit sync object > > support to the DRM & v4l2 interfaces? > > > > E.g. I believe DRM/GEM's job dispatch API is "in-order" > > in which case it might be easy to just add "wait for this fence" > > and "signal this fence" ioctls. Seems like vmwgfx already has > > something similar to this already? Could this work over having > > to specify a list of sync objects to wait on and another list > > of sync objects to signal for every operation (exec buf/page > > flip)? What about for v4l2? > > If I understand you right a job submission with explicit sync would > become 3 submission: > 1) submit wait for pre-req fence job > 2) submit render job > 3) submit signal ready fence job > > Does DRM provide a way to ensure these 3 jobs are submitted > atomically? I also expect GPU vendor would like to get clever about > GPU to GPU fence dependancies. That could probably be handled > entirely in the userspace GL driver. Well, drm doesn't provide any way to submit a job. These are all done in driver-private ioctls. And I guess with your proposal below we can do exactly what you want. > > I guess my other thought is that implicit vs explicit is not > > mutually exclusive, though I'd guess there'd be interesting > > deadlocks to have to debug if both were in use _at the same > > time_. :-) > > I think this is an approach worth investigating. I'd like a way to > either opt out of implicit sync or have a way to check if a dma-buf > has an attached fence and detach it. Actually, that could work really > well. Consider: > > * Each dma_buf has a single fence "slot" > * on submission >* the driver will extract the fence from the dma_buf and queue a wait on > it. >* the driver will replace that fence with it's own complettion > fence before the job submission ioctl returns. This is pretty much what I've had in mind with the extension that we probably need both a read and a write fence - in a lot of cases multiple people want to use a buffer for reads (e.g. when decoding video streams the decode needs it as a reference frame wheras later stages use it read-only, too). > * dma_buf will have two userspace ioctls: >* DETACH: will return the fence as an FD to userspace and clear the > fence slot in the dma_buf >* ATTACH: takes a fence FD from userspace and attaches it to the > dma_buf fence slot. Returns an error if the fence slot is non-empty. I am not yet sold on explicit fences, especially for cross-device sync. I do see uses for explicit fences that can be accessed from userspace for individual drivers - otherwise tricks like suballocation are a bit hard to pull off. But for cross-device buffer sharing I don't quite see the point, especially since the current Linux userspace graphics stack manages to do so without (e.g. DRI2 is all implicit sync'ed). btw, I'll try to stitch together a more elaborate discussion over the w/e, I have a few more pet-peeves with your actual implementation ;-) Yours, Daniel -- Daniel Vetter Mail: daniel at ffwll.ch Mobile: +41 (0)79 365 57 48
Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
On Sat, Jun 9, 2012 at 12:22 AM, Erik Gilling konk...@android.com wrote: The current linux graphics stack does not allow synchronization between the GPU and a camera/video decoder. When we've seen people try to support this behind the scenes, they get it wrong and introduce bugs that can take weeks to track down. As stated in the previous email, one of our goals is to centrally manage synchronization so that it's easer for people bringing up a platform to get it right. I agree that letting everyone reinvent the wheel isn't the best idea for cross-device sync - people will just get it wrong way too often. I'm not convinced yet that doing it with explicit sync points/fences and in userspace is the best solution. dri2/gem all use implicit sync points managed by the kernel in a transparent fashion, so I'm leaning towards such a sulotion for cross-device sync, too. Imo the big upside of such an implicitly sync'ed approach is that it massively simplifies cross-process protocols (i.e. for the display server). So to foster understanding of the various requirements and use-cases, could you elaborate on the pros and cons a bit and explain why you think explicit sync points managed by the userspace display server is the best approach for android? Yours, Daniel -- Daniel Vetter daniel.vet...@ffwll.ch - +41 (0) 79 364 57 48 - http://blog.ffwll.ch ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
On Fri, Jun 8, 2012 at 2:42 PM, Daniel Vetter wrote: >> I think this is an approach worth investigating. ?I'd like a way to >> either opt out of implicit sync or have a way to check if a dma-buf >> has an attached fence and detach it. ?Actually, that could work really >> well. Consider: >> >> * Each dma_buf has a single fence "slot" >> * on submission >> ? ?* the driver will extract the fence from the dma_buf and queue a wait on >> it. >> ? ?* the driver will replace that fence with it's own complettion >> fence before the job submission ioctl returns. > > This is pretty much what I've had in mind with the extension that we > probably need both a read and a write fence - in a lot of cases multiple > people want to use a buffer for reads (e.g. when decoding video streams > the decode needs it as a reference frame wheras later stages use it > read-only, too). I actually hit "send" instead of "save draft" on this before talking this over with some co-workers. We came up with the same issues. I'm actually less concerned about the specifics as long as we have a way to attach and detach the fences. >> * dma_buf will have two userspace ioctls: >> ? ?* DETACH: will return the fence as an FD to userspace and clear the >> fence slot in the dma_buf >> ? ?* ATTACH: takes a fence FD from userspace and attaches it to the >> dma_buf fence slot. ?Returns an error if the fence slot is non-empty. > > I am not yet sold on explicit fences, especially for cross-device sync. I > do see uses for explicit fences that can be accessed from userspace for > individual drivers - otherwise tricks like suballocation are a bit hard to > pull off. But for cross-device buffer sharing I don't quite see the point, > especially since the current Linux userspace graphics stack manages to do > so without (e.g. DRI2 is all implicit sync'ed). The current linux graphics stack does not allow synchronization between the GPU and a camera/video decoder. When we've seen people try to support this behind the scenes, they get it wrong and introduce bugs that can take weeks to track down. As stated in the previous email, one of our goals is to centrally manage synchronization so that it's easer for people bringing up a platform to get it right. > btw, I'll try to stitch together a more elaborate discussion over the w/e, > I have a few more pet-peeves with your actual implementation ;-) Happy to hear feedback on the specifics. -Erik
[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
On Thu, Jun 7, 2012 at 4:35 AM, Tom Cooksey wrote: > The alternate is to not associate sync objects with buffers and > have them be distinct entities, exposed to userspace. This gives > userpsace more power and flexibility and might allow for use-cases > which an implicit synchronization mechanism can't satisfy - I'd > be curious to know any specifics here. Time and time again we've had problems with implicit synchronization resulting in bugs where different drivers play by slightly different implicit rules. We're convinced the best way to attack this problem is to move as much of the command and control of synchronization as possible into a single piece of code (the compositor in our case.) To facilitate this we're going to be mandating this explicit approach in the K release of Android. > However, every driver which > needs to participate in the synchronization mechanism will need > to have its interface with userspace modified to allow the sync > objects to be passed to the drivers. This seemed like a lot of > work to me, which is why I prefer the implicit approach. However > I don't actually know what work is needed and think it should be > explored. I.e. How much work is it to add explicit sync object > support to the DRM & v4l2 interfaces? > > E.g. I believe DRM/GEM's job dispatch API is "in-order" > in which case it might be easy to just add "wait for this fence" > and "signal this fence" ioctls. Seems like vmwgfx already has > something similar to this already? Could this work over having > to specify a list of sync objects to wait on and another list > of sync objects to signal for every operation (exec buf/page > flip)? What about for v4l2? If I understand you right a job submission with explicit sync would become 3 submission: 1) submit wait for pre-req fence job 2) submit render job 3) submit signal ready fence job Does DRM provide a way to ensure these 3 jobs are submitted atomically? I also expect GPU vendor would like to get clever about GPU to GPU fence dependancies. That could probably be handled entirely in the userspace GL driver. > I guess my other thought is that implicit vs explicit is not > mutually exclusive, though I'd guess there'd be interesting > deadlocks to have to debug if both were in use _at the same > time_. :-) I think this is an approach worth investigating. I'd like a way to either opt out of implicit sync or have a way to check if a dma-buf has an attached fence and detach it. Actually, that could work really well. Consider: * Each dma_buf has a single fence "slot" * on submission * the driver will extract the fence from the dma_buf and queue a wait on it. * the driver will replace that fence with it's own complettion fence before the job submission ioctl returns. * dma_buf will have two userspace ioctls: * DETACH: will return the fence as an FD to userspace and clear the fence slot in the dma_buf * ATTACH: takes a fence FD from userspace and attaches it to the dma_buf fence slot. Returns an error if the fence slot is non-empty. In the android case, we can do a detach after every submission and an attach right before. -Erik
Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
On Fri, Jun 08, 2012 at 01:56:05PM -0700, Erik Gilling wrote: On Thu, Jun 7, 2012 at 4:35 AM, Tom Cooksey tom.cook...@arm.com wrote: The alternate is to not associate sync objects with buffers and have them be distinct entities, exposed to userspace. This gives userpsace more power and flexibility and might allow for use-cases which an implicit synchronization mechanism can't satisfy - I'd be curious to know any specifics here. Time and time again we've had problems with implicit synchronization resulting in bugs where different drivers play by slightly different implicit rules. We're convinced the best way to attack this problem is to move as much of the command and control of synchronization as possible into a single piece of code (the compositor in our case.) To facilitate this we're going to be mandating this explicit approach in the K release of Android. However, every driver which needs to participate in the synchronization mechanism will need to have its interface with userspace modified to allow the sync objects to be passed to the drivers. This seemed like a lot of work to me, which is why I prefer the implicit approach. However I don't actually know what work is needed and think it should be explored. I.e. How much work is it to add explicit sync object support to the DRM v4l2 interfaces? E.g. I believe DRM/GEM's job dispatch API is in-order in which case it might be easy to just add wait for this fence and signal this fence ioctls. Seems like vmwgfx already has something similar to this already? Could this work over having to specify a list of sync objects to wait on and another list of sync objects to signal for every operation (exec buf/page flip)? What about for v4l2? If I understand you right a job submission with explicit sync would become 3 submission: 1) submit wait for pre-req fence job 2) submit render job 3) submit signal ready fence job Does DRM provide a way to ensure these 3 jobs are submitted atomically? I also expect GPU vendor would like to get clever about GPU to GPU fence dependancies. That could probably be handled entirely in the userspace GL driver. Well, drm doesn't provide any way to submit a job. These are all done in driver-private ioctls. And I guess with your proposal below we can do exactly what you want. I guess my other thought is that implicit vs explicit is not mutually exclusive, though I'd guess there'd be interesting deadlocks to have to debug if both were in use _at the same time_. :-) I think this is an approach worth investigating. I'd like a way to either opt out of implicit sync or have a way to check if a dma-buf has an attached fence and detach it. Actually, that could work really well. Consider: * Each dma_buf has a single fence slot * on submission * the driver will extract the fence from the dma_buf and queue a wait on it. * the driver will replace that fence with it's own complettion fence before the job submission ioctl returns. This is pretty much what I've had in mind with the extension that we probably need both a read and a write fence - in a lot of cases multiple people want to use a buffer for reads (e.g. when decoding video streams the decode needs it as a reference frame wheras later stages use it read-only, too). * dma_buf will have two userspace ioctls: * DETACH: will return the fence as an FD to userspace and clear the fence slot in the dma_buf * ATTACH: takes a fence FD from userspace and attaches it to the dma_buf fence slot. Returns an error if the fence slot is non-empty. I am not yet sold on explicit fences, especially for cross-device sync. I do see uses for explicit fences that can be accessed from userspace for individual drivers - otherwise tricks like suballocation are a bit hard to pull off. But for cross-device buffer sharing I don't quite see the point, especially since the current Linux userspace graphics stack manages to do so without (e.g. DRI2 is all implicit sync'ed). btw, I'll try to stitch together a more elaborate discussion over the w/e, I have a few more pet-peeves with your actual implementation ;-) Yours, Daniel -- Daniel Vetter Mail: dan...@ffwll.ch Mobile: +41 (0)79 365 57 48 ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
On Thu, Jun 7, 2012 at 4:35 AM, Tom Cooksey tom.cook...@arm.com wrote: The alternate is to not associate sync objects with buffers and have them be distinct entities, exposed to userspace. This gives userpsace more power and flexibility and might allow for use-cases which an implicit synchronization mechanism can't satisfy - I'd be curious to know any specifics here. Time and time again we've had problems with implicit synchronization resulting in bugs where different drivers play by slightly different implicit rules. We're convinced the best way to attack this problem is to move as much of the command and control of synchronization as possible into a single piece of code (the compositor in our case.) To facilitate this we're going to be mandating this explicit approach in the K release of Android. However, every driver which needs to participate in the synchronization mechanism will need to have its interface with userspace modified to allow the sync objects to be passed to the drivers. This seemed like a lot of work to me, which is why I prefer the implicit approach. However I don't actually know what work is needed and think it should be explored. I.e. How much work is it to add explicit sync object support to the DRM v4l2 interfaces? E.g. I believe DRM/GEM's job dispatch API is in-order in which case it might be easy to just add wait for this fence and signal this fence ioctls. Seems like vmwgfx already has something similar to this already? Could this work over having to specify a list of sync objects to wait on and another list of sync objects to signal for every operation (exec buf/page flip)? What about for v4l2? If I understand you right a job submission with explicit sync would become 3 submission: 1) submit wait for pre-req fence job 2) submit render job 3) submit signal ready fence job Does DRM provide a way to ensure these 3 jobs are submitted atomically? I also expect GPU vendor would like to get clever about GPU to GPU fence dependancies. That could probably be handled entirely in the userspace GL driver. I guess my other thought is that implicit vs explicit is not mutually exclusive, though I'd guess there'd be interesting deadlocks to have to debug if both were in use _at the same time_. :-) I think this is an approach worth investigating. I'd like a way to either opt out of implicit sync or have a way to check if a dma-buf has an attached fence and detach it. Actually, that could work really well. Consider: * Each dma_buf has a single fence slot * on submission * the driver will extract the fence from the dma_buf and queue a wait on it. * the driver will replace that fence with it's own complettion fence before the job submission ioctl returns. * dma_buf will have two userspace ioctls: * DETACH: will return the fence as an FD to userspace and clear the fence slot in the dma_buf * ATTACH: takes a fence FD from userspace and attaches it to the dma_buf fence slot. Returns an error if the fence slot is non-empty. In the android case, we can do a detach after every submission and an attach right before. -Erik ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
On Fri, Jun 8, 2012 at 2:42 PM, Daniel Vetter dan...@ffwll.ch wrote: I think this is an approach worth investigating. I'd like a way to either opt out of implicit sync or have a way to check if a dma-buf has an attached fence and detach it. Actually, that could work really well. Consider: * Each dma_buf has a single fence slot * on submission * the driver will extract the fence from the dma_buf and queue a wait on it. * the driver will replace that fence with it's own complettion fence before the job submission ioctl returns. This is pretty much what I've had in mind with the extension that we probably need both a read and a write fence - in a lot of cases multiple people want to use a buffer for reads (e.g. when decoding video streams the decode needs it as a reference frame wheras later stages use it read-only, too). I actually hit send instead of save draft on this before talking this over with some co-workers. We came up with the same issues. I'm actually less concerned about the specifics as long as we have a way to attach and detach the fences. * dma_buf will have two userspace ioctls: * DETACH: will return the fence as an FD to userspace and clear the fence slot in the dma_buf * ATTACH: takes a fence FD from userspace and attaches it to the dma_buf fence slot. Returns an error if the fence slot is non-empty. I am not yet sold on explicit fences, especially for cross-device sync. I do see uses for explicit fences that can be accessed from userspace for individual drivers - otherwise tricks like suballocation are a bit hard to pull off. But for cross-device buffer sharing I don't quite see the point, especially since the current Linux userspace graphics stack manages to do so without (e.g. DRI2 is all implicit sync'ed). The current linux graphics stack does not allow synchronization between the GPU and a camera/video decoder. When we've seen people try to support this behind the scenes, they get it wrong and introduce bugs that can take weeks to track down. As stated in the previous email, one of our goals is to centrally manage synchronization so that it's easer for people bringing up a platform to get it right. btw, I'll try to stitch together a more elaborate discussion over the w/e, I have a few more pet-peeves with your actual implementation ;-) Happy to hear feedback on the specifics. -Erik ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
> >>>?The bigger issue is the previous point about how to deal > >>> with cases where the CPU doesn't really need to get involved as an > >>> intermediary. > >>> > >>> CPU fallback access to the buffer is the only legit case where we > >>> need a standardized API to userspace (since CPU access isn't already > >>> associated w/ some other kernel device file where some extra ioctl > >>> can be added) > >> > >> The CPU case will still need to wait on an arbitrarily backed sync > >> primitive. ?It shouldn't need to know if it's backed by the gpu, > >> camera, or dsp. > > > > Right, this is the one place we definitely need something.. some > > userspace code would just get passed a dmabuf file descriptor and > > want to mmap it and do something, without really knowing where it > > came from. ?I *guess* we'll have to add some ioctl's to the dmabuf > > fd. > > I personally favor having sync primitives have their own anon inode > vs. strictly coupling them with dma_buf. I think this is really the crux of the matter - do we associate sync objects with buffers or not. The approach ARM are suggesting _is_ to associate the sync objects with the buffer and do this by adding kds_resource* as a member of struct dma_buf. The main reason I want to do this is because it doesn't require changes to existing interfaces. Specifically, DRM/KMS & v4l2. These user/kernel interfaces already allow userspace to specify the handle of a buffer the driver should perform an operation on. What dma_buf has done is allowed those driver-specific buffer handles to be exported from one driver and imported into another. While new ioctls have been added to the v4l2 & DRM interfaces for dma_buf, they have only been to allow the import & export of driver-specific buffer objects. Once imported as a driver specific buffer object, existing ioctls are re-used to perform operations on those buffers (at least this is what PRIME does for DRM, I'm not so sure about v4l2?). But my point is that no new "page flip to this dma_buf fd" ioctl has been added to KMS, you use the existing drm_mode_crtc_page_flip and specify an fb_id which has been imported from a dma_buf. If we associate sync objects with buffers, none of those device specific ioctls which perform operations on buffer objects need to be modified. It's just that internally, those drivers use kds or something similar to make sure they don't tread on each other's toes. The alternate is to not associate sync objects with buffers and have them be distinct entities, exposed to userspace. This gives userpsace more power and flexibility and might allow for use-cases which an implicit synchronization mechanism can't satisfy - I'd be curious to know any specifics here. However, every driver which needs to participate in the synchronization mechanism will need to have its interface with userspace modified to allow the sync objects to be passed to the drivers. This seemed like a lot of work to me, which is why I prefer the implicit approach. However I don't actually know what work is needed and think it should be explored. I.e. How much work is it to add explicit sync object support to the DRM & v4l2 interfaces? E.g. I believe DRM/GEM's job dispatch API is "in-order" in which case it might be easy to just add "wait for this fence" and "signal this fence" ioctls. Seems like vmwgfx already has something similar to this already? Could this work over having to specify a list of sync objects to wait on and another list of sync objects to signal for every operation (exec buf/page flip)? What about for v4l2? I guess my other thought is that implicit vs explicit is not mutually exclusive, though I'd guess there'd be interesting deadlocks to have to debug if both were in use _at the same time_. :-) Cheers, Tom
RE: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
Hi All, I'm the original designer of the KDS system that Tom posted while I was on paternity leave. Find my responses inline... -Original Message- From: linaro-mm-sig-boun...@lists.linaro.org [mailto:linaro-mm-sig- boun...@lists.linaro.org] On Behalf Of Rob Clark Sent: Monday, June 04, 2012 10:31 PM To: Tom Cooksey Cc: linaro-mm-...@lists.linaro.org; Pauli; dri- de...@lists.freedesktop.org Subject: Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices Some comments inline.. at this stage mostly superficial issues about how the API works, etc.. not had a chance to dig too much into the implementation yet (although some of my comments about the API would change those anyways). It was the API we really wanted input on. The implementation still leaves a bit to be desired. Anyways, thanks for getting the ball rolling on this, and I think I can volunteer linaro to pick up and run w/ this if needed. On Fri, May 25, 2012 at 7:08 PM, Tom Cooksey tom.cook...@arm.com wrote: Hi All, I realise it's been a while since this was last discussed, however I'd like to bring up kernel-side synchronization again. By kernel-side synchronization, I mean allowing multiple drivers/devices wanting to access the same buffer to do so without bouncing up to userspace to resolve dependencies such as the display controller can't start scanning out a buffer until the GPU has finished rendering into it. As such, this is really just an optimization which reduces latency between E.g. The GPU finishing a rendering job and that buffer being scanned out. I appreciate this particular example is already solved on desktop graphics cards as the display controller and 3D core are both controlled by the same driver, so no generic mechanism is needed. However on ARM SoCs, the 3D core (like an ARM Mali) and display controller tend to be driven by separate drivers, so some mechanism is needed to allow both drivers to synchronize their access to buffers. There are multiple ways synchronization can be achieved, fences/sync objects is one common approach, however we're presenting a different approach. Personally, I quite like fence sync objects, however we believe it requires a lot of userspace interfaces to be changed to pass around sync object handles. Our hope is that the kds approach will require less effort to make use of as no existing userspace interfaces need to be changed. E.g. To use explicit fences, the struct drm_mode_crtc_page_flip would need a new members to pass in the handle(s) of sync object(s) which the flip depends on (I.e. don't flip until these fences fire). The additional benefit of our approach is that it prevents userspace specifying dependency loops which can cause a deadlock (see kds.txt for an explanation of what I mean here). I have waited until now to bring this up again because I am now able to share the code I was trying (and failing I think) to explain previously. The code has now been released under the GPLv2 from ARM Mali's developer portal, however I've attempted to turn that into a patch to allow it to be discussed on this list. Please find the patch inline below. While KDS defines a very generic mechanism, I am proposing that this code or at least the concepts be merged with the existing dma_buf code, so a the struct kds_resource members get moved to struct dma_buf, kds_* functions get renamed to dma_buf_* functions, etc. So I guess what I'm saying is please don't review the actual code just yet, only the concepts the code describes, where kds_resource == dma_duf. Cheers, Tom Author: Tom Cooksey tom.cook...@arm.com Date: Fri May 25 10:45:27 2012 +0100 Add new system to allow synchronizing access to resources See Documentation/kds.txt for details, however the general idea is that this kds framework synchronizes multiple drivers (clients) wanting to access the same resources, where a resource is typically a 2D image buffer being shared around using dma-buf. Note: This patch is created by extracting the sources from the tarball on http://www.malideveloper.com/open-source-mali-gpus-lin ux-kernel-device-drivers---dev-releases.php and putting them in roughly the right places. diff --git a/Documentation/kds.txt b/Documentation/kds.txt fwiw, I think the documentation could be made a bit more generic, but this and code style, etc shouldn't be too hard to fix new file mode 100644 index 000..a96db21 --- /dev/null +++ b/Documentation/kds.txt @@ -0,0 +1,113 @@ +# +# (C) COPYRIGHT 2012 ARM Limited. All rights reserved. +# +# This program is free software and is provided to you under the terms of the GNU General Public License version 2 +# as published by the Free Software Foundation, and any use by you of this program
Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
On Wed, Jun 6, 2012 at 6:33 AM, John Reitan john.rei...@arm.com wrote: But maybe instead of inventing something new, we can just use 'struct kthread_work' instead of 'struct kds_callback' plus the two 'void *'s? If the user needs some extra args they can embed 'struct kthread_work' in their own struct and use container_of() magic in the cb. Plus this is a natural fit if you want to dispatch callbacks instead on a kthread_worker, which seems like it would simplify a few things when it comes to deadlock avoidance.. ie., not resource deadlock avoidance, but dispatching callbacks when some lock is held. That sounds like a better approach. Will make a cleaner API, will look into it. When Tom visited us for android graphics camp in the fall he argued that there were cases where we would want to avoid an extra schedule. Consider the case where the GPU is waiting for a render buffer that the display controller is using. If that render can be kicked off w/o acquiring locks, the display's vsync IRQ handler can call release, which in turn calls the GPU callback, which in turn kicks off the render very quickly w/o having to leave IRQ context. One way around the locking issue with callbacks/async wait is to have async wait return a value to indicate that the resource has been acquired instead of calling the callback. This is the approach I chose in our sync framework. -Erik ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
RE: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
The bigger issue is the previous point about how to deal with cases where the CPU doesn't really need to get involved as an intermediary. CPU fallback access to the buffer is the only legit case where we need a standardized API to userspace (since CPU access isn't already associated w/ some other kernel device file where some extra ioctl can be added) The CPU case will still need to wait on an arbitrarily backed sync primitive. It shouldn't need to know if it's backed by the gpu, camera, or dsp. Right, this is the one place we definitely need something.. some userspace code would just get passed a dmabuf file descriptor and want to mmap it and do something, without really knowing where it came from. I *guess* we'll have to add some ioctl's to the dmabuf fd. I personally favor having sync primitives have their own anon inode vs. strictly coupling them with dma_buf. I think this is really the crux of the matter - do we associate sync objects with buffers or not. The approach ARM are suggesting _is_ to associate the sync objects with the buffer and do this by adding kds_resource* as a member of struct dma_buf. The main reason I want to do this is because it doesn't require changes to existing interfaces. Specifically, DRM/KMS v4l2. These user/kernel interfaces already allow userspace to specify the handle of a buffer the driver should perform an operation on. What dma_buf has done is allowed those driver-specific buffer handles to be exported from one driver and imported into another. While new ioctls have been added to the v4l2 DRM interfaces for dma_buf, they have only been to allow the import export of driver-specific buffer objects. Once imported as a driver specific buffer object, existing ioctls are re-used to perform operations on those buffers (at least this is what PRIME does for DRM, I'm not so sure about v4l2?). But my point is that no new page flip to this dma_buf fd ioctl has been added to KMS, you use the existing drm_mode_crtc_page_flip and specify an fb_id which has been imported from a dma_buf. If we associate sync objects with buffers, none of those device specific ioctls which perform operations on buffer objects need to be modified. It's just that internally, those drivers use kds or something similar to make sure they don't tread on each other's toes. The alternate is to not associate sync objects with buffers and have them be distinct entities, exposed to userspace. This gives userpsace more power and flexibility and might allow for use-cases which an implicit synchronization mechanism can't satisfy - I'd be curious to know any specifics here. However, every driver which needs to participate in the synchronization mechanism will need to have its interface with userspace modified to allow the sync objects to be passed to the drivers. This seemed like a lot of work to me, which is why I prefer the implicit approach. However I don't actually know what work is needed and think it should be explored. I.e. How much work is it to add explicit sync object support to the DRM v4l2 interfaces? E.g. I believe DRM/GEM's job dispatch API is in-order in which case it might be easy to just add wait for this fence and signal this fence ioctls. Seems like vmwgfx already has something similar to this already? Could this work over having to specify a list of sync objects to wait on and another list of sync objects to signal for every operation (exec buf/page flip)? What about for v4l2? I guess my other thought is that implicit vs explicit is not mutually exclusive, though I'd guess there'd be interesting deadlocks to have to debug if both were in use _at the same time_. :-) Cheers, Tom ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
Hi All, I'm the original designer of the KDS system that Tom posted while I was on paternity leave. Find my responses inline... > -Original Message- > From: linaro-mm-sig-bounces at lists.linaro.org [mailto:linaro-mm-sig- > bounces at lists.linaro.org] On Behalf Of Rob Clark > Sent: Monday, June 04, 2012 10:31 PM > To: Tom Cooksey > Cc: linaro-mm-sig at lists.linaro.org; Pauli; dri- > devel at lists.freedesktop.org > Subject: Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers > shared with dma-buf between drivers/devices > > Some comments inline.. at this stage mostly superficial issues about > how the API works, etc.. not had a chance to dig too much into the > implementation yet (although some of my comments about the API would > change those anyways). It was the API we really wanted input on. The implementation still leaves a bit to be desired. > Anyways, thanks for getting the ball rolling on this, and I think I > can volunteer linaro to pick up and run w/ this if needed. > > On Fri, May 25, 2012 at 7:08 PM, Tom Cooksey > wrote: > > Hi All, > > > > I realise it's been a while since this was last discussed, however > I'd like > > to bring up kernel-side synchronization again. By kernel-side > > synchronization, I mean allowing multiple drivers/devices wanting to > access > > the same buffer to do so without bouncing up to userspace to resolve > > dependencies such as "the display controller can't start scanning out > a > > buffer until the GPU has finished rendering into it". As such, this > is > > really just an optimization which reduces latency between E.g. The > GPU > > finishing a rendering job and that buffer being scanned out. I > appreciate > > this particular example is already solved on desktop graphics cards > as the > > display controller and 3D core are both controlled by the same > driver, so no > > "generic" mechanism is needed. However on ARM SoCs, the 3D core (like > an ARM > > Mali) and display controller tend to be driven by separate drivers, > so some > > mechanism is needed to allow both drivers to synchronize their access > to > > buffers. > > > > There are multiple ways synchronization can be achieved, fences/sync > objects > > is one common approach, however we're presenting a different > approach. > > Personally, I quite like fence sync objects, however we believe it > requires > > a lot of userspace interfaces to be changed to pass around sync > object > > handles. Our hope is that the kds approach will require less effort > to make > > use of as no existing userspace interfaces need to be changed. E.g. > To use > > explicit fences, the struct drm_mode_crtc_page_flip would need a new > members > > to pass in the handle(s) of sync object(s) which the flip depends on > (I.e. > > don't flip until these fences fire). The additional benefit of our > approach > > is that it prevents userspace specifying dependency loops which can > cause a > > deadlock (see kds.txt for an explanation of what I mean here). > > > > I have waited until now to bring this up again because I am now able > to > > share the code I was trying (and failing I think) to explain > previously. The > > code has now been released under the GPLv2 from ARM Mali's developer > portal, > > however I've attempted to turn that into a patch to allow it to be > discussed > > on this list. Please find the patch inline below. > > > > While KDS defines a very generic mechanism, I am proposing that this > code or > > at least the concepts be merged with the existing dma_buf code, so a > the > > struct kds_resource members get moved to struct dma_buf, kds_* > functions get > > renamed to dma_buf_* functions, etc. So I guess what I'm saying is > please > > don't review the actual code just yet, only the concepts the code > describes, > > where kds_resource == dma_duf. > > > > > > Cheers, > > > > Tom > > > > > > > > Author: Tom Cooksey > > Date: Fri May 25 10:45:27 2012 +0100 > > > >Add new system to allow synchronizing access to resources > > > >See Documentation/kds.txt for details, however the general > >idea is that this kds framework synchronizes multiple drivers > >("clients") wanting to access the same resources, where a > >resource is typically a 2D image buffer being shared around > >using dma-buf. > > > >Note: This patch is created by extracting the sources from the > >tarball on <http://www.malideveloper.com/open-
[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices
On Wed, Jun 6, 2012 at 6:33 AM, John Reitan wrote: >> But maybe instead of inventing something new, we can just use 'struct >> kthread_work' instead of 'struct kds_callback' plus the two 'void *'s? >> ?If the user needs some extra args they can embed 'struct >> kthread_work' in their own struct and use container_of() magic in the >> cb. >> >> Plus this is a natural fit if you want to dispatch callbacks instead >> on a kthread_worker, which seems like it would simplify a few things >> when it comes to deadlock avoidance.. ie., not resource deadlock >> avoidance, but dispatching callbacks when some lock is held. > > That sounds like a better approach. > Will make a cleaner API, will look into it. When Tom visited us for android graphics camp in the fall he argued that there were cases where we would want to avoid an extra schedule. Consider the case where the GPU is waiting for a render buffer that the display controller is using. If that render can be kicked off w/o acquiring locks, the display's vsync IRQ handler can call release, which in turn calls the GPU callback, which in turn kicks off the render very quickly w/o having to leave IRQ context. One way around the locking issue with callbacks/async wait is to have async wait return a value to indicate that the resource has been acquired instead of calling the callback. This is the approach I chose in our sync framework. -Erik