Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-10 Thread Clark, Rob
On Fri, Jun 8, 2012 at 3:56 PM, Erik Gilling konk...@android.com wrote:
 I guess my other thought is that implicit vs explicit is not
 mutually exclusive, though I'd guess there'd be interesting
 deadlocks to have to debug if both were in use _at the same
 time_. :-)

 I think this is an approach worth investigating.  I'd like a way to
 either opt out of implicit sync or have a way to check if a dma-buf
 has an attached fence and detach it.  Actually, that could work really
 well. Consider:

 * Each dma_buf has a single fence slot
 * on submission
   * the driver will extract the fence from the dma_buf and queue a wait on it.
   * the driver will replace that fence with it's own complettion
 fence before the job submission ioctl returns.
 * dma_buf will have two userspace ioctls:
   * DETACH: will return the fence as an FD to userspace and clear the
 fence slot in the dma_buf
   * ATTACH: takes a fence FD from userspace and attaches it to the
 dma_buf fence slot.  Returns an error if the fence slot is non-empty.

 In the android case, we can do a detach after every submission and an
 attach right before.

btw, I like this idea for implicit and explicit sync to coexist

BR,
-R
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-09 Thread Daniel Vetter
On Sat, Jun 9, 2012 at 12:22 AM, Erik Gilling  wrote:
> The current linux graphics stack does not allow synchronization
> between the GPU and a camera/video decoder. ?When we've seen people
> try to support this behind the scenes, they get it wrong and introduce
> bugs that can take weeks to track down. ?As stated in the previous
> email, one of our goals is to centrally manage synchronization so that
> it's easer for people bringing up a platform to get it right.

I agree that letting everyone reinvent the wheel isn't the best idea
for cross-device sync - people will just get it wrong way too often.
I'm not convinced yet that doing it with explicit sync points/fences
and in userspace is the best solution. dri2/gem all use implicit sync
points managed by the kernel in a transparent fashion, so I'm leaning
towards such a sulotion for cross-device sync, too. Imo the big upside
of such an implicitly sync'ed approach is that it massively simplifies
cross-process protocols (i.e. for the display server).

So to foster understanding of the various requirements and use-cases,
could you elaborate on the pros and cons a bit and explain why you
think explicit sync points managed by the userspace display server is
the best approach for android?

Yours, Daniel
-- 
Daniel Vetter
daniel.vetter at ffwll.ch - +41 (0) 79 364 57 48 - http://blog.ffwll.ch


[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-09 Thread Clark, Rob
On Fri, Jun 8, 2012 at 3:56 PM, Erik Gilling  wrote:
>> I guess my other thought is that implicit vs explicit is not
>> mutually exclusive, though I'd guess there'd be interesting
>> deadlocks to have to debug if both were in use _at the same
>> time_. :-)
>
> I think this is an approach worth investigating. ?I'd like a way to
> either opt out of implicit sync or have a way to check if a dma-buf
> has an attached fence and detach it. ?Actually, that could work really
> well. Consider:
>
> * Each dma_buf has a single fence "slot"
> * on submission
> ? * the driver will extract the fence from the dma_buf and queue a wait on it.
> ? * the driver will replace that fence with it's own complettion
> fence before the job submission ioctl returns.
> * dma_buf will have two userspace ioctls:
> ? * DETACH: will return the fence as an FD to userspace and clear the
> fence slot in the dma_buf
> ? * ATTACH: takes a fence FD from userspace and attaches it to the
> dma_buf fence slot. ?Returns an error if the fence slot is non-empty.
>
> In the android case, we can do a detach after every submission and an
> attach right before.

btw, I like this idea for implicit and explicit sync to coexist

BR,
-R


[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-09 Thread Daniel Vetter
On Fri, Jun 08, 2012 at 01:56:05PM -0700, Erik Gilling wrote:
> On Thu, Jun 7, 2012 at 4:35 AM, Tom Cooksey  wrote:
> > The alternate is to not associate sync objects with buffers and
> > have them be distinct entities, exposed to userspace. This gives
> > userpsace more power and flexibility and might allow for use-cases
> > which an implicit synchronization mechanism can't satisfy - I'd
> > be curious to know any specifics here.
> 
> Time and time again we've had problems with implicit synchronization
> resulting in bugs where different drivers play by slightly different
> implicit rules.  We're convinced the best way to attack this problem
> is to move as much of the command and control of synchronization as
> possible into a single piece of code (the compositor in our case.)  To
> facilitate this we're going to be mandating this explicit approach in
> the K release of Android.
> 
> > However, every driver which
> > needs to participate in the synchronization mechanism will need
> > to have its interface with userspace modified to allow the sync
> > objects to be passed to the drivers. This seemed like a lot of
> > work to me, which is why I prefer the implicit approach. However
> > I don't actually know what work is needed and think it should be
> > explored. I.e. How much work is it to add explicit sync object
> > support to the DRM & v4l2 interfaces?
> >
> > E.g. I believe DRM/GEM's job dispatch API is "in-order"
> > in which case it might be easy to just add "wait for this fence"
> > and "signal this fence" ioctls. Seems like vmwgfx already has
> > something similar to this already? Could this work over having
> > to specify a list of sync objects to wait on and another list
> > of sync objects to signal for every operation (exec buf/page
> > flip)? What about for v4l2?
> 
> If I understand you right a job submission with explicit sync would
> become 3 submission:
> 1) submit wait for pre-req fence job
> 2) submit render job
> 3) submit signal ready fence job
> 
> Does DRM provide a way to ensure these 3 jobs are submitted
> atomically?  I also expect GPU vendor would like to get clever about
> GPU to GPU fence dependancies.  That could probably be handled
> entirely in the userspace GL driver.

Well, drm doesn't provide any way to submit a job. These are all done in
driver-private ioctls. And I guess with your proposal below we can do
exactly what you want.

> > I guess my other thought is that implicit vs explicit is not
> > mutually exclusive, though I'd guess there'd be interesting
> > deadlocks to have to debug if both were in use _at the same
> > time_. :-)
> 
> I think this is an approach worth investigating.  I'd like a way to
> either opt out of implicit sync or have a way to check if a dma-buf
> has an attached fence and detach it.  Actually, that could work really
> well. Consider:
> 
> * Each dma_buf has a single fence "slot"
> * on submission
>* the driver will extract the fence from the dma_buf and queue a wait on 
> it.
>* the driver will replace that fence with it's own complettion
> fence before the job submission ioctl returns.

This is pretty much what I've had in mind with the extension that we
probably need both a read and a write fence - in a lot of cases multiple
people want to use a buffer for reads (e.g. when decoding video streams
the decode needs it as a reference frame wheras later stages use it
read-only, too).

> * dma_buf will have two userspace ioctls:
>* DETACH: will return the fence as an FD to userspace and clear the
> fence slot in the dma_buf
>* ATTACH: takes a fence FD from userspace and attaches it to the
> dma_buf fence slot.  Returns an error if the fence slot is non-empty.

I am not yet sold on explicit fences, especially for cross-device sync. I
do see uses for explicit fences that can be accessed from userspace for
individual drivers - otherwise tricks like suballocation are a bit hard to
pull off. But for cross-device buffer sharing I don't quite see the point,
especially since the current Linux userspace graphics stack manages to do
so without (e.g. DRI2 is all implicit sync'ed).

btw, I'll try to stitch together a more elaborate discussion over the w/e,
I have a few more pet-peeves with your actual implementation ;-)

Yours, Daniel
-- 
Daniel Vetter
Mail: daniel at ffwll.ch
Mobile: +41 (0)79 365 57 48


Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-09 Thread Daniel Vetter
On Sat, Jun 9, 2012 at 12:22 AM, Erik Gilling konk...@android.com wrote:
 The current linux graphics stack does not allow synchronization
 between the GPU and a camera/video decoder.  When we've seen people
 try to support this behind the scenes, they get it wrong and introduce
 bugs that can take weeks to track down.  As stated in the previous
 email, one of our goals is to centrally manage synchronization so that
 it's easer for people bringing up a platform to get it right.

I agree that letting everyone reinvent the wheel isn't the best idea
for cross-device sync - people will just get it wrong way too often.
I'm not convinced yet that doing it with explicit sync points/fences
and in userspace is the best solution. dri2/gem all use implicit sync
points managed by the kernel in a transparent fashion, so I'm leaning
towards such a sulotion for cross-device sync, too. Imo the big upside
of such an implicitly sync'ed approach is that it massively simplifies
cross-process protocols (i.e. for the display server).

So to foster understanding of the various requirements and use-cases,
could you elaborate on the pros and cons a bit and explain why you
think explicit sync points managed by the userspace display server is
the best approach for android?

Yours, Daniel
-- 
Daniel Vetter
daniel.vet...@ffwll.ch - +41 (0) 79 364 57 48 - http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-08 Thread Erik Gilling
On Fri, Jun 8, 2012 at 2:42 PM, Daniel Vetter  wrote:
>> I think this is an approach worth investigating. ?I'd like a way to
>> either opt out of implicit sync or have a way to check if a dma-buf
>> has an attached fence and detach it. ?Actually, that could work really
>> well. Consider:
>>
>> * Each dma_buf has a single fence "slot"
>> * on submission
>> ? ?* the driver will extract the fence from the dma_buf and queue a wait on 
>> it.
>> ? ?* the driver will replace that fence with it's own complettion
>> fence before the job submission ioctl returns.
>
> This is pretty much what I've had in mind with the extension that we
> probably need both a read and a write fence - in a lot of cases multiple
> people want to use a buffer for reads (e.g. when decoding video streams
> the decode needs it as a reference frame wheras later stages use it
> read-only, too).

I actually hit "send" instead of "save draft" on this before talking
this over with some co-workers.  We came up with the same issues.  I'm
actually less concerned about the specifics as long as we have a way
to attach and detach the fences.

>> * dma_buf will have two userspace ioctls:
>> ? ?* DETACH: will return the fence as an FD to userspace and clear the
>> fence slot in the dma_buf
>> ? ?* ATTACH: takes a fence FD from userspace and attaches it to the
>> dma_buf fence slot. ?Returns an error if the fence slot is non-empty.
>
> I am not yet sold on explicit fences, especially for cross-device sync. I
> do see uses for explicit fences that can be accessed from userspace for
> individual drivers - otherwise tricks like suballocation are a bit hard to
> pull off. But for cross-device buffer sharing I don't quite see the point,
> especially since the current Linux userspace graphics stack manages to do
> so without (e.g. DRI2 is all implicit sync'ed).

The current linux graphics stack does not allow synchronization
between the GPU and a camera/video decoder.  When we've seen people
try to support this behind the scenes, they get it wrong and introduce
bugs that can take weeks to track down.  As stated in the previous
email, one of our goals is to centrally manage synchronization so that
it's easer for people bringing up a platform to get it right.

> btw, I'll try to stitch together a more elaborate discussion over the w/e,
> I have a few more pet-peeves with your actual implementation ;-)

Happy to hear feedback on the specifics.

-Erik


[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-08 Thread Erik Gilling
On Thu, Jun 7, 2012 at 4:35 AM, Tom Cooksey  wrote:
> The alternate is to not associate sync objects with buffers and
> have them be distinct entities, exposed to userspace. This gives
> userpsace more power and flexibility and might allow for use-cases
> which an implicit synchronization mechanism can't satisfy - I'd
> be curious to know any specifics here.

Time and time again we've had problems with implicit synchronization
resulting in bugs where different drivers play by slightly different
implicit rules.  We're convinced the best way to attack this problem
is to move as much of the command and control of synchronization as
possible into a single piece of code (the compositor in our case.)  To
facilitate this we're going to be mandating this explicit approach in
the K release of Android.

> However, every driver which
> needs to participate in the synchronization mechanism will need
> to have its interface with userspace modified to allow the sync
> objects to be passed to the drivers. This seemed like a lot of
> work to me, which is why I prefer the implicit approach. However
> I don't actually know what work is needed and think it should be
> explored. I.e. How much work is it to add explicit sync object
> support to the DRM & v4l2 interfaces?
>
> E.g. I believe DRM/GEM's job dispatch API is "in-order"
> in which case it might be easy to just add "wait for this fence"
> and "signal this fence" ioctls. Seems like vmwgfx already has
> something similar to this already? Could this work over having
> to specify a list of sync objects to wait on and another list
> of sync objects to signal for every operation (exec buf/page
> flip)? What about for v4l2?

If I understand you right a job submission with explicit sync would
become 3 submission:
1) submit wait for pre-req fence job
2) submit render job
3) submit signal ready fence job

Does DRM provide a way to ensure these 3 jobs are submitted
atomically?  I also expect GPU vendor would like to get clever about
GPU to GPU fence dependancies.  That could probably be handled
entirely in the userspace GL driver.

> I guess my other thought is that implicit vs explicit is not
> mutually exclusive, though I'd guess there'd be interesting
> deadlocks to have to debug if both were in use _at the same
> time_. :-)

I think this is an approach worth investigating.  I'd like a way to
either opt out of implicit sync or have a way to check if a dma-buf
has an attached fence and detach it.  Actually, that could work really
well. Consider:

* Each dma_buf has a single fence "slot"
* on submission
   * the driver will extract the fence from the dma_buf and queue a wait on it.
   * the driver will replace that fence with it's own complettion
fence before the job submission ioctl returns.
* dma_buf will have two userspace ioctls:
   * DETACH: will return the fence as an FD to userspace and clear the
fence slot in the dma_buf
   * ATTACH: takes a fence FD from userspace and attaches it to the
dma_buf fence slot.  Returns an error if the fence slot is non-empty.

In the android case, we can do a detach after every submission and an
attach right before.

-Erik


Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-08 Thread Daniel Vetter
On Fri, Jun 08, 2012 at 01:56:05PM -0700, Erik Gilling wrote:
 On Thu, Jun 7, 2012 at 4:35 AM, Tom Cooksey tom.cook...@arm.com wrote:
  The alternate is to not associate sync objects with buffers and
  have them be distinct entities, exposed to userspace. This gives
  userpsace more power and flexibility and might allow for use-cases
  which an implicit synchronization mechanism can't satisfy - I'd
  be curious to know any specifics here.
 
 Time and time again we've had problems with implicit synchronization
 resulting in bugs where different drivers play by slightly different
 implicit rules.  We're convinced the best way to attack this problem
 is to move as much of the command and control of synchronization as
 possible into a single piece of code (the compositor in our case.)  To
 facilitate this we're going to be mandating this explicit approach in
 the K release of Android.
 
  However, every driver which
  needs to participate in the synchronization mechanism will need
  to have its interface with userspace modified to allow the sync
  objects to be passed to the drivers. This seemed like a lot of
  work to me, which is why I prefer the implicit approach. However
  I don't actually know what work is needed and think it should be
  explored. I.e. How much work is it to add explicit sync object
  support to the DRM  v4l2 interfaces?
 
  E.g. I believe DRM/GEM's job dispatch API is in-order
  in which case it might be easy to just add wait for this fence
  and signal this fence ioctls. Seems like vmwgfx already has
  something similar to this already? Could this work over having
  to specify a list of sync objects to wait on and another list
  of sync objects to signal for every operation (exec buf/page
  flip)? What about for v4l2?
 
 If I understand you right a job submission with explicit sync would
 become 3 submission:
 1) submit wait for pre-req fence job
 2) submit render job
 3) submit signal ready fence job
 
 Does DRM provide a way to ensure these 3 jobs are submitted
 atomically?  I also expect GPU vendor would like to get clever about
 GPU to GPU fence dependancies.  That could probably be handled
 entirely in the userspace GL driver.

Well, drm doesn't provide any way to submit a job. These are all done in
driver-private ioctls. And I guess with your proposal below we can do
exactly what you want.

  I guess my other thought is that implicit vs explicit is not
  mutually exclusive, though I'd guess there'd be interesting
  deadlocks to have to debug if both were in use _at the same
  time_. :-)
 
 I think this is an approach worth investigating.  I'd like a way to
 either opt out of implicit sync or have a way to check if a dma-buf
 has an attached fence and detach it.  Actually, that could work really
 well. Consider:
 
 * Each dma_buf has a single fence slot
 * on submission
* the driver will extract the fence from the dma_buf and queue a wait on 
 it.
* the driver will replace that fence with it's own complettion
 fence before the job submission ioctl returns.

This is pretty much what I've had in mind with the extension that we
probably need both a read and a write fence - in a lot of cases multiple
people want to use a buffer for reads (e.g. when decoding video streams
the decode needs it as a reference frame wheras later stages use it
read-only, too).

 * dma_buf will have two userspace ioctls:
* DETACH: will return the fence as an FD to userspace and clear the
 fence slot in the dma_buf
* ATTACH: takes a fence FD from userspace and attaches it to the
 dma_buf fence slot.  Returns an error if the fence slot is non-empty.

I am not yet sold on explicit fences, especially for cross-device sync. I
do see uses for explicit fences that can be accessed from userspace for
individual drivers - otherwise tricks like suballocation are a bit hard to
pull off. But for cross-device buffer sharing I don't quite see the point,
especially since the current Linux userspace graphics stack manages to do
so without (e.g. DRI2 is all implicit sync'ed).

btw, I'll try to stitch together a more elaborate discussion over the w/e,
I have a few more pet-peeves with your actual implementation ;-)

Yours, Daniel
-- 
Daniel Vetter
Mail: dan...@ffwll.ch
Mobile: +41 (0)79 365 57 48
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-08 Thread Erik Gilling
On Thu, Jun 7, 2012 at 4:35 AM, Tom Cooksey tom.cook...@arm.com wrote:
 The alternate is to not associate sync objects with buffers and
 have them be distinct entities, exposed to userspace. This gives
 userpsace more power and flexibility and might allow for use-cases
 which an implicit synchronization mechanism can't satisfy - I'd
 be curious to know any specifics here.

Time and time again we've had problems with implicit synchronization
resulting in bugs where different drivers play by slightly different
implicit rules.  We're convinced the best way to attack this problem
is to move as much of the command and control of synchronization as
possible into a single piece of code (the compositor in our case.)  To
facilitate this we're going to be mandating this explicit approach in
the K release of Android.

 However, every driver which
 needs to participate in the synchronization mechanism will need
 to have its interface with userspace modified to allow the sync
 objects to be passed to the drivers. This seemed like a lot of
 work to me, which is why I prefer the implicit approach. However
 I don't actually know what work is needed and think it should be
 explored. I.e. How much work is it to add explicit sync object
 support to the DRM  v4l2 interfaces?

 E.g. I believe DRM/GEM's job dispatch API is in-order
 in which case it might be easy to just add wait for this fence
 and signal this fence ioctls. Seems like vmwgfx already has
 something similar to this already? Could this work over having
 to specify a list of sync objects to wait on and another list
 of sync objects to signal for every operation (exec buf/page
 flip)? What about for v4l2?

If I understand you right a job submission with explicit sync would
become 3 submission:
1) submit wait for pre-req fence job
2) submit render job
3) submit signal ready fence job

Does DRM provide a way to ensure these 3 jobs are submitted
atomically?  I also expect GPU vendor would like to get clever about
GPU to GPU fence dependancies.  That could probably be handled
entirely in the userspace GL driver.

 I guess my other thought is that implicit vs explicit is not
 mutually exclusive, though I'd guess there'd be interesting
 deadlocks to have to debug if both were in use _at the same
 time_. :-)

I think this is an approach worth investigating.  I'd like a way to
either opt out of implicit sync or have a way to check if a dma-buf
has an attached fence and detach it.  Actually, that could work really
well. Consider:

* Each dma_buf has a single fence slot
* on submission
   * the driver will extract the fence from the dma_buf and queue a wait on it.
   * the driver will replace that fence with it's own complettion
fence before the job submission ioctl returns.
* dma_buf will have two userspace ioctls:
   * DETACH: will return the fence as an FD to userspace and clear the
fence slot in the dma_buf
   * ATTACH: takes a fence FD from userspace and attaches it to the
dma_buf fence slot.  Returns an error if the fence slot is non-empty.

In the android case, we can do a detach after every submission and an
attach right before.

-Erik
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-08 Thread Erik Gilling
On Fri, Jun 8, 2012 at 2:42 PM, Daniel Vetter dan...@ffwll.ch wrote:
 I think this is an approach worth investigating.  I'd like a way to
 either opt out of implicit sync or have a way to check if a dma-buf
 has an attached fence and detach it.  Actually, that could work really
 well. Consider:

 * Each dma_buf has a single fence slot
 * on submission
    * the driver will extract the fence from the dma_buf and queue a wait on 
 it.
    * the driver will replace that fence with it's own complettion
 fence before the job submission ioctl returns.

 This is pretty much what I've had in mind with the extension that we
 probably need both a read and a write fence - in a lot of cases multiple
 people want to use a buffer for reads (e.g. when decoding video streams
 the decode needs it as a reference frame wheras later stages use it
 read-only, too).

I actually hit send instead of save draft on this before talking
this over with some co-workers.  We came up with the same issues.  I'm
actually less concerned about the specifics as long as we have a way
to attach and detach the fences.

 * dma_buf will have two userspace ioctls:
    * DETACH: will return the fence as an FD to userspace and clear the
 fence slot in the dma_buf
    * ATTACH: takes a fence FD from userspace and attaches it to the
 dma_buf fence slot.  Returns an error if the fence slot is non-empty.

 I am not yet sold on explicit fences, especially for cross-device sync. I
 do see uses for explicit fences that can be accessed from userspace for
 individual drivers - otherwise tricks like suballocation are a bit hard to
 pull off. But for cross-device buffer sharing I don't quite see the point,
 especially since the current Linux userspace graphics stack manages to do
 so without (e.g. DRI2 is all implicit sync'ed).

The current linux graphics stack does not allow synchronization
between the GPU and a camera/video decoder.  When we've seen people
try to support this behind the scenes, they get it wrong and introduce
bugs that can take weeks to track down.  As stated in the previous
email, one of our goals is to centrally manage synchronization so that
it's easer for people bringing up a platform to get it right.

 btw, I'll try to stitch together a more elaborate discussion over the w/e,
 I have a few more pet-peeves with your actual implementation ;-)

Happy to hear feedback on the specifics.

-Erik
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-07 Thread Tom Cooksey


> >>>?The bigger issue is the previous point about how to deal
> >>> with cases where the CPU doesn't really need to get involved as an
> >>> intermediary.
> >>>
> >>> CPU fallback access to the buffer is the only legit case where we
> >>> need a standardized API to userspace (since CPU access isn't already
> >>> associated w/ some other kernel device file where some extra ioctl
> >>> can be added)
> >>
> >> The CPU case will still need to wait on an arbitrarily backed sync
> >> primitive. ?It shouldn't need to know if it's backed by the gpu,
> >> camera, or dsp.
> >
> > Right, this is the one place we definitely need something.. some
> > userspace code would just get passed a dmabuf file descriptor and
> > want to mmap it and do something, without really knowing where it
> > came from. ?I *guess* we'll have to add some ioctl's to the dmabuf
> > fd.
> 
> I personally favor having sync primitives have their own anon inode
> vs. strictly coupling them with dma_buf.

I think this is really the crux of the matter - do we associate sync
objects with buffers or not. The approach ARM are suggesting _is_ to
associate the sync objects with the buffer and do this by adding
kds_resource* as a member of struct dma_buf. The main reason I want
to do this is because it doesn't require changes to existing
interfaces. Specifically, DRM/KMS & v4l2. These user/kernel interfaces
already allow userspace to specify the handle of a buffer the driver
should perform an operation on. What dma_buf has done is allowed those
driver-specific buffer handles to be exported from one driver and
imported into another. While new ioctls have been added to the v4l2 &
DRM interfaces for dma_buf, they have only been to allow the import &
export of driver-specific buffer objects. Once imported as a driver
specific buffer object, existing ioctls are re-used to perform
operations on those buffers (at least this is what PRIME does for DRM,
I'm not so sure about v4l2?). But my point is that no new "page flip
to this dma_buf fd" ioctl has been added to KMS, you use the existing
drm_mode_crtc_page_flip and specify an fb_id which has been imported
from a dma_buf.

If we associate sync objects with buffers, none of those device
specific ioctls which perform operations on buffer objects need to
be modified. It's just that internally, those drivers use kds or
something similar to make sure they don't tread on each other's
toes.

The alternate is to not associate sync objects with buffers and
have them be distinct entities, exposed to userspace. This gives
userpsace more power and flexibility and might allow for use-cases
which an implicit synchronization mechanism can't satisfy - I'd
be curious to know any specifics here. However, every driver which
needs to participate in the synchronization mechanism will need
to have its interface with userspace modified to allow the sync
objects to be passed to the drivers. This seemed like a lot of
work to me, which is why I prefer the implicit approach. However
I don't actually know what work is needed and think it should be
explored. I.e. How much work is it to add explicit sync object
support to the DRM & v4l2 interfaces?

E.g. I believe DRM/GEM's job dispatch API is "in-order"
in which case it might be easy to just add "wait for this fence"
and "signal this fence" ioctls. Seems like vmwgfx already has
something similar to this already? Could this work over having
to specify a list of sync objects to wait on and another list
of sync objects to signal for every operation (exec buf/page
flip)? What about for v4l2?

I guess my other thought is that implicit vs explicit is not
mutually exclusive, though I'd guess there'd be interesting
deadlocks to have to debug if both were in use _at the same
time_. :-)


Cheers,

Tom






RE: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-07 Thread John Reitan
Hi All,

I'm the original designer of the KDS system that Tom posted 
while I was on paternity leave. Find my responses inline...

 -Original Message-
 From: linaro-mm-sig-boun...@lists.linaro.org [mailto:linaro-mm-sig-
 boun...@lists.linaro.org] On Behalf Of Rob Clark
 Sent: Monday, June 04, 2012 10:31 PM
 To: Tom Cooksey
 Cc: linaro-mm-...@lists.linaro.org; Pauli; dri-
 de...@lists.freedesktop.org
 Subject: Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers
 shared with dma-buf between drivers/devices
 
 Some comments inline.. at this stage mostly superficial issues about
 how the API works, etc..  not had a chance to dig too much into the
 implementation yet (although some of my comments about the API would
 change those anyways).

It was the API we really wanted input on.
The implementation still leaves a bit to be desired. 

 Anyways, thanks for getting the ball rolling on this, and I think I
 can volunteer linaro to pick up and run w/ this if needed.
 
 On Fri, May 25, 2012 at 7:08 PM, Tom Cooksey tom.cook...@arm.com
 wrote:
  Hi All,
 
  I realise it's been a while since this was last discussed, however
 I'd like
  to bring up kernel-side synchronization again. By kernel-side
  synchronization, I mean allowing multiple drivers/devices wanting to
 access
  the same buffer to do so without bouncing up to userspace to resolve
  dependencies such as the display controller can't start scanning out
 a
  buffer until the GPU has finished rendering into it. As such, this
 is
  really just an optimization which reduces latency between E.g. The
 GPU
  finishing a rendering job and that buffer being scanned out. I
 appreciate
  this particular example is already solved on desktop graphics cards
 as the
  display controller and 3D core are both controlled by the same
 driver, so no
  generic mechanism is needed. However on ARM SoCs, the 3D core (like
 an ARM
  Mali) and display controller tend to be driven by separate drivers,
 so some
  mechanism is needed to allow both drivers to synchronize their access
 to
  buffers.
 
  There are multiple ways synchronization can be achieved, fences/sync
 objects
  is one common approach, however we're presenting a different
 approach.
  Personally, I quite like fence sync objects, however we believe it
 requires
  a lot of userspace interfaces to be changed to pass around sync
 object
  handles. Our hope is that the kds approach will require less effort
 to make
  use of as no existing userspace interfaces need to be changed. E.g.
 To use
  explicit fences, the struct drm_mode_crtc_page_flip would need a new
 members
  to pass in the handle(s) of sync object(s) which the flip depends on
 (I.e.
  don't flip until these fences fire). The additional benefit of our
 approach
  is that it prevents userspace specifying dependency loops which can
 cause a
  deadlock (see kds.txt for an explanation of what I mean here).
 
  I have waited until now to bring this up again because I am now able
 to
  share the code I was trying (and failing I think) to explain
 previously. The
  code has now been released under the GPLv2 from ARM Mali's developer
 portal,
  however I've attempted to turn that into a patch to allow it to be
 discussed
  on this list. Please find the patch inline below.
 
  While KDS defines a very generic mechanism, I am proposing that this
 code or
  at least the concepts be merged with the existing dma_buf code, so a
 the
  struct kds_resource members get moved to struct dma_buf, kds_*
 functions get
  renamed to dma_buf_* functions, etc. So I guess what I'm saying is
 please
  don't review the actual code just yet, only the concepts the code
 describes,
  where kds_resource == dma_duf.
 
 
  Cheers,
 
  Tom
 
 
 
  Author: Tom Cooksey tom.cook...@arm.com
  Date:   Fri May 25 10:45:27 2012 +0100
 
 Add new system to allow synchronizing access to resources
 
 See Documentation/kds.txt for details, however the general
 idea is that this kds framework synchronizes multiple drivers
 (clients) wanting to access the same resources, where a
 resource is typically a 2D image buffer being shared around
 using dma-buf.
 
 Note: This patch is created by extracting the sources from the
 tarball on http://www.malideveloper.com/open-source-mali-gpus-lin
 ux-kernel-device-drivers---dev-releases.php and putting them in
 roughly the right places.
 
  diff --git a/Documentation/kds.txt b/Documentation/kds.txt
 
 fwiw, I think the documentation could be made a bit more generic, but
 this and code style, etc shouldn't be too hard to fix
 
  new file mode 100644
  index 000..a96db21
  --- /dev/null
  +++ b/Documentation/kds.txt
  @@ -0,0 +1,113 @@
  +#
  +# (C) COPYRIGHT 2012 ARM Limited. All rights reserved.
  +#
  +# This program is free software and is provided to you under the
 terms of
  the GNU General Public License version 2
  +# as published by the Free Software Foundation, and any use by you
 of this
  program

Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-07 Thread Erik Gilling
On Wed, Jun 6, 2012 at 6:33 AM, John Reitan john.rei...@arm.com wrote:
 But maybe instead of inventing something new, we can just use 'struct
 kthread_work' instead of 'struct kds_callback' plus the two 'void *'s?
  If the user needs some extra args they can embed 'struct
 kthread_work' in their own struct and use container_of() magic in the
 cb.

 Plus this is a natural fit if you want to dispatch callbacks instead
 on a kthread_worker, which seems like it would simplify a few things
 when it comes to deadlock avoidance.. ie., not resource deadlock
 avoidance, but dispatching callbacks when some lock is held.

 That sounds like a better approach.
 Will make a cleaner API, will look into it.

When Tom visited us for android graphics camp in the fall he argued
that there were cases where we would want to avoid an extra schedule.
Consider the case where the GPU is waiting for a render buffer that
the display controller is using.  If that render can be kicked off w/o
acquiring locks, the display's vsync IRQ handler can call release,
which in turn calls the GPU callback, which in turn kicks off the
render very quickly w/o having to leave IRQ context.

One way around the locking issue with callbacks/async wait is to have
async wait return a value to indicate that the resource has been
acquired instead of calling the callback.  This is the approach I
chose in our sync framework.

-Erik
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


RE: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-07 Thread Tom Cooksey


  The bigger issue is the previous point about how to deal
  with cases where the CPU doesn't really need to get involved as an
  intermediary.
 
  CPU fallback access to the buffer is the only legit case where we
  need a standardized API to userspace (since CPU access isn't already
  associated w/ some other kernel device file where some extra ioctl
  can be added)
 
  The CPU case will still need to wait on an arbitrarily backed sync
  primitive.  It shouldn't need to know if it's backed by the gpu,
  camera, or dsp.
 
  Right, this is the one place we definitely need something.. some
  userspace code would just get passed a dmabuf file descriptor and
  want to mmap it and do something, without really knowing where it
  came from.  I *guess* we'll have to add some ioctl's to the dmabuf
  fd.
 
 I personally favor having sync primitives have their own anon inode
 vs. strictly coupling them with dma_buf.

I think this is really the crux of the matter - do we associate sync
objects with buffers or not. The approach ARM are suggesting _is_ to
associate the sync objects with the buffer and do this by adding
kds_resource* as a member of struct dma_buf. The main reason I want
to do this is because it doesn't require changes to existing
interfaces. Specifically, DRM/KMS  v4l2. These user/kernel interfaces
already allow userspace to specify the handle of a buffer the driver
should perform an operation on. What dma_buf has done is allowed those
driver-specific buffer handles to be exported from one driver and
imported into another. While new ioctls have been added to the v4l2 
DRM interfaces for dma_buf, they have only been to allow the import 
export of driver-specific buffer objects. Once imported as a driver
specific buffer object, existing ioctls are re-used to perform
operations on those buffers (at least this is what PRIME does for DRM,
I'm not so sure about v4l2?). But my point is that no new page flip
to this dma_buf fd ioctl has been added to KMS, you use the existing
drm_mode_crtc_page_flip and specify an fb_id which has been imported
from a dma_buf.

If we associate sync objects with buffers, none of those device
specific ioctls which perform operations on buffer objects need to
be modified. It's just that internally, those drivers use kds or
something similar to make sure they don't tread on each other's
toes.

The alternate is to not associate sync objects with buffers and
have them be distinct entities, exposed to userspace. This gives
userpsace more power and flexibility and might allow for use-cases
which an implicit synchronization mechanism can't satisfy - I'd
be curious to know any specifics here. However, every driver which
needs to participate in the synchronization mechanism will need
to have its interface with userspace modified to allow the sync
objects to be passed to the drivers. This seemed like a lot of
work to me, which is why I prefer the implicit approach. However
I don't actually know what work is needed and think it should be
explored. I.e. How much work is it to add explicit sync object
support to the DRM  v4l2 interfaces?

E.g. I believe DRM/GEM's job dispatch API is in-order
in which case it might be easy to just add wait for this fence
and signal this fence ioctls. Seems like vmwgfx already has
something similar to this already? Could this work over having
to specify a list of sync objects to wait on and another list
of sync objects to signal for every operation (exec buf/page
flip)? What about for v4l2?

I guess my other thought is that implicit vs explicit is not
mutually exclusive, though I'd guess there'd be interesting
deadlocks to have to debug if both were in use _at the same
time_. :-)


Cheers,

Tom




___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-06 Thread John Reitan
Hi All,

I'm the original designer of the KDS system that Tom posted 
while I was on paternity leave. Find my responses inline...

> -Original Message-
> From: linaro-mm-sig-bounces at lists.linaro.org [mailto:linaro-mm-sig-
> bounces at lists.linaro.org] On Behalf Of Rob Clark
> Sent: Monday, June 04, 2012 10:31 PM
> To: Tom Cooksey
> Cc: linaro-mm-sig at lists.linaro.org; Pauli; dri-
> devel at lists.freedesktop.org
> Subject: Re: [Linaro-mm-sig] [RFC] Synchronizing access to buffers
> shared with dma-buf between drivers/devices
> 
> Some comments inline.. at this stage mostly superficial issues about
> how the API works, etc..  not had a chance to dig too much into the
> implementation yet (although some of my comments about the API would
> change those anyways).

It was the API we really wanted input on.
The implementation still leaves a bit to be desired. 

> Anyways, thanks for getting the ball rolling on this, and I think I
> can volunteer linaro to pick up and run w/ this if needed.
> 
> On Fri, May 25, 2012 at 7:08 PM, Tom Cooksey 
> wrote:
> > Hi All,
> >
> > I realise it's been a while since this was last discussed, however
> I'd like
> > to bring up kernel-side synchronization again. By kernel-side
> > synchronization, I mean allowing multiple drivers/devices wanting to
> access
> > the same buffer to do so without bouncing up to userspace to resolve
> > dependencies such as "the display controller can't start scanning out
> a
> > buffer until the GPU has finished rendering into it". As such, this
> is
> > really just an optimization which reduces latency between E.g. The
> GPU
> > finishing a rendering job and that buffer being scanned out. I
> appreciate
> > this particular example is already solved on desktop graphics cards
> as the
> > display controller and 3D core are both controlled by the same
> driver, so no
> > "generic" mechanism is needed. However on ARM SoCs, the 3D core (like
> an ARM
> > Mali) and display controller tend to be driven by separate drivers,
> so some
> > mechanism is needed to allow both drivers to synchronize their access
> to
> > buffers.
> >
> > There are multiple ways synchronization can be achieved, fences/sync
> objects
> > is one common approach, however we're presenting a different
> approach.
> > Personally, I quite like fence sync objects, however we believe it
> requires
> > a lot of userspace interfaces to be changed to pass around sync
> object
> > handles. Our hope is that the kds approach will require less effort
> to make
> > use of as no existing userspace interfaces need to be changed. E.g.
> To use
> > explicit fences, the struct drm_mode_crtc_page_flip would need a new
> members
> > to pass in the handle(s) of sync object(s) which the flip depends on
> (I.e.
> > don't flip until these fences fire). The additional benefit of our
> approach
> > is that it prevents userspace specifying dependency loops which can
> cause a
> > deadlock (see kds.txt for an explanation of what I mean here).
> >
> > I have waited until now to bring this up again because I am now able
> to
> > share the code I was trying (and failing I think) to explain
> previously. The
> > code has now been released under the GPLv2 from ARM Mali's developer
> portal,
> > however I've attempted to turn that into a patch to allow it to be
> discussed
> > on this list. Please find the patch inline below.
> >
> > While KDS defines a very generic mechanism, I am proposing that this
> code or
> > at least the concepts be merged with the existing dma_buf code, so a
> the
> > struct kds_resource members get moved to struct dma_buf, kds_*
> functions get
> > renamed to dma_buf_* functions, etc. So I guess what I'm saying is
> please
> > don't review the actual code just yet, only the concepts the code
> describes,
> > where kds_resource == dma_duf.
> >
> >
> > Cheers,
> >
> > Tom
> >
> >
> >
> > Author: Tom Cooksey 
> > Date:   Fri May 25 10:45:27 2012 +0100
> >
> >Add new system to allow synchronizing access to resources
> >
> >See Documentation/kds.txt for details, however the general
> >idea is that this kds framework synchronizes multiple drivers
> >("clients") wanting to access the same resources, where a
> >resource is typically a 2D image buffer being shared around
> >using dma-buf.
> >
> >Note: This patch is created by extracting the sources from the
> >tarball on <http://www.malideveloper.com/open-

[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-06 Thread Erik Gilling
On Wed, Jun 6, 2012 at 6:33 AM, John Reitan  wrote:
>> But maybe instead of inventing something new, we can just use 'struct
>> kthread_work' instead of 'struct kds_callback' plus the two 'void *'s?
>> ?If the user needs some extra args they can embed 'struct
>> kthread_work' in their own struct and use container_of() magic in the
>> cb.
>>
>> Plus this is a natural fit if you want to dispatch callbacks instead
>> on a kthread_worker, which seems like it would simplify a few things
>> when it comes to deadlock avoidance.. ie., not resource deadlock
>> avoidance, but dispatching callbacks when some lock is held.
>
> That sounds like a better approach.
> Will make a cleaner API, will look into it.

When Tom visited us for android graphics camp in the fall he argued
that there were cases where we would want to avoid an extra schedule.
Consider the case where the GPU is waiting for a render buffer that
the display controller is using.  If that render can be kicked off w/o
acquiring locks, the display's vsync IRQ handler can call release,
which in turn calls the GPU callback, which in turn kicks off the
render very quickly w/o having to leave IRQ context.

One way around the locking issue with callbacks/async wait is to have
async wait return a value to indicate that the resource has been
acquired instead of calling the callback.  This is the approach I
chose in our sync framework.

-Erik