Re: Introduce a new helper framework for buffer synchronization

2013-05-28 Thread Maarten Lankhorst
Hey,

Op 28-05-13 04:49, Inki Dae schreef:

 -Original Message-
 From: Maarten Lankhorst [mailto:maarten.lankho...@canonical.com]
 Sent: Tuesday, May 28, 2013 12:23 AM
 To: Inki Dae
 Cc: 'Daniel Vetter'; 'Rob Clark'; 'linux-fbdev'; 'YoungJun Cho'; 'Kyungmin
 Park'; 'myungjoo.ham'; 'DRI mailing list'; linux-arm-
 ker...@lists.infradead.org; linux-media@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer synchronization

 Hey,

 Op 27-05-13 12:38, Inki Dae schreef:
 Hi all,

 I have been removed previous branch and added new one with more cleanup.
 This time, the fence helper doesn't include user side interfaces and
 cache
 operation relevant codes anymore because not only we are not sure that
 coupling those two things, synchronizing caches and buffer access
 between
 CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is
 a
 good idea yet but also existing codes for user side have problems with
 badly
 behaved or crashing userspace. So this could be more discussed later.

 The below is a new branch,

 https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
 exynos.git/?h=dma-f
 ence-helper

 And fence helper codes,

 https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
 exynos.git/commit/?
 h=dma-fence-helperid=adcbc0fe7e285ce866e5816e5e21443dcce01005

 And example codes for device driver,

 https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
 exynos.git/commit/?
 h=dma-fence-helperid=d2ce7af23835789602a99d0ccef1f53cdd5caaae

 I think the time is not yet ripe for RFC posting: maybe existing dma
 fence
 and reservation need more review and addition work. So I'd glad for
 somebody
 giving other opinions and advices in advance before RFC posting.

 NAK.

 For examples for how to handle locking properly, see Documentation/ww-
 mutex-design.txt in my recent tree.
 I could list what I believe is wrong with your implementation, but real
 problem is that the approach you're taking is wrong.
 I just removed ticket stubs to show my approach you guys as simple as
 possible, and I just wanted to show that we could use buffer synchronization
 mechanism without ticket stubs.
The tickets have been removed in favor of a ww_context. Moving it in as a base 
primitive
allows more locking abuse to be detected, and makes some other things easier 
too.

 Question, WW-Mutexes could be used for all devices? I guess this has
 dependence on x86 gpu: gpu has VRAM and it means different memory domain.
 And could you tell my why shared fence should have only eight objects? I
 think we could need more than eight objects for read access. Anyway I think
 I don't surely understand yet so there might be my missing point.
Yes, ww mutexes are not limited in any way to x86. They're a locking mechanism.
When you acquired the ww mutexes for all buffer objects, all it does is say at
that point in time you have exclusively acquired the locks of all bo's.

After locking everything you can read the fence pointers safely, queue waits, 
and set a
new fence pointer on all reservation_objects. You only need a single fence
on all those objects, so 8 is plenty. Nonetheless this was a limitation of my
earlier design, and I'll dynamically allocate fence_shared in the future.

~Maarten

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Introduce a new helper framework for buffer synchronization

2013-05-28 Thread Daniel Vetter
On Tue, May 28, 2013 at 12:56:57PM +0900, Inki Dae wrote:
 
 
  -Original Message-
  From: linux-fbdev-ow...@vger.kernel.org [mailto:linux-fbdev-
  ow...@vger.kernel.org] On Behalf Of Rob Clark
  Sent: Tuesday, May 28, 2013 12:48 AM
  To: Inki Dae
  Cc: Maarten Lankhorst; Daniel Vetter; linux-fbdev; YoungJun Cho; Kyungmin
  Park; myungjoo.ham; DRI mailing list;
 linux-arm-ker...@lists.infradead.org;
  linux-media@vger.kernel.org
  Subject: Re: Introduce a new helper framework for buffer synchronization
  
  On Mon, May 27, 2013 at 6:38 AM, Inki Dae inki@samsung.com wrote:
   Hi all,
  
   I have been removed previous branch and added new one with more cleanup.
   This time, the fence helper doesn't include user side interfaces and
  cache
   operation relevant codes anymore because not only we are not sure that
   coupling those two things, synchronizing caches and buffer access
  between
   CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is
  a
   good idea yet but also existing codes for user side have problems with
  badly
   behaved or crashing userspace. So this could be more discussed later.
  
   The below is a new branch,
  
   https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
  exynos.git/?h=dma-f
   ence-helper
  
   And fence helper codes,
  
   https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
  exynos.git/commit/?
   h=dma-fence-helperid=adcbc0fe7e285ce866e5816e5e21443dcce01005
  
   And example codes for device driver,
  
   https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
  exynos.git/commit/?
   h=dma-fence-helperid=d2ce7af23835789602a99d0ccef1f53cdd5caaae
  
   I think the time is not yet ripe for RFC posting: maybe existing dma
  fence
   and reservation need more review and addition work. So I'd glad for
  somebody
   giving other opinions and advices in advance before RFC posting.
  
  thoughts from a *really* quick, pre-coffee, first look:
  * any sort of helper to simplify single-buffer sort of use-cases (v4l)
  probably wouldn't want to bake in assumption that seqno_fence is used.
  * I guess g2d is probably not actually a simple use case, since I
  expect you can submit blits involving multiple buffers :-P
 
 I don't think so. One and more buffers can be used: seqno_fence also has
 only one buffer. Actually, we have already applied this approach to most
 devices; multimedia, gpu and display controller. And this approach shows
 more performance; reduced power consumption against traditional way. And g2d
 example is just to show you how to apply my approach to device driver.

Note that seqno_fence is an implementation pattern for a certain type of
direct hw-hw synchronization which uses a shared dma_buf to exchange the
sync cookie. The dma_buf attached to the seqno_fence has _nothing_ to do
with the dma_buf the fence actually coordinates access to.

I think that confusing is a large reason for why MaartenI don't
understand what you want to achieve with your fence helpers. Currently
they're using the seqno_fence, but totally not in a way the seqno_fence
was meant to be used.

Note that with the current code there is only a pointer from dma_bufs to
the fence. The fence itself has _no_ pointer to the dma_buf it syncs. This
shouldn't be a problem since the fence fastpath for already signalled
fences is completely barrierlock free (it's just a load+bit-test), and
fences are meant to be embedded into whatever dma tracking structure you
already have, so no overhead there. The only ugly part is the fence
refcounting, but I don't think we can drop that.

Note that you completely reinvent this part of Maarten's fence patches by
adding new r/w_complete completions to the reservation object, which
completely replaces the fence stuff.

Also note that a list of reservation entries is again meant to be used
only when submitting the dma to the gpu. With your patches you seem to
hang onto that list until dma completes. This has the ugly side-effect
that you need to allocate these reservation entries with kmalloc, whereas
if you just use them in the execbuf ioctl to construct the dma you can
usually embed it into something else you need already anyway. At least
i915 and ttm based drivers can work that way.

Furthermore fences are specifically constructed as frankenstein-monsters
between completion/waitqueues and callbacks. All the different use-cases
need the different aspects:
- busy/idle checks and bo retiring need the completion semantics
- callbacks (in interrupt context) are used for hybrid hw-irq handler-hw
  sync approaches

 
  * otherwise, you probably don't want to depend on dmabuf, which is why
  reservation/fence is split out the way it is..  you want to be able to
  use a single reservation/fence mechanism within your driver without
  having to care about which buffers are exported to dmabuf's and which
  are not.  Creating a dmabuf for every GEM bo is too heavyweight.
 
 Right. But I think we should dealt with this separately

Re: Introduce a new helper framework for buffer synchronization

2013-05-28 Thread Rob Clark
On Mon, May 27, 2013 at 11:56 PM, Inki Dae inki@samsung.com wrote:


 -Original Message-
 From: linux-fbdev-ow...@vger.kernel.org [mailto:linux-fbdev-
 ow...@vger.kernel.org] On Behalf Of Rob Clark
 Sent: Tuesday, May 28, 2013 12:48 AM
 To: Inki Dae
 Cc: Maarten Lankhorst; Daniel Vetter; linux-fbdev; YoungJun Cho; Kyungmin
 Park; myungjoo.ham; DRI mailing list;
 linux-arm-ker...@lists.infradead.org;
 linux-media@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer synchronization

 On Mon, May 27, 2013 at 6:38 AM, Inki Dae inki@samsung.com wrote:
  Hi all,
 
  I have been removed previous branch and added new one with more cleanup.
  This time, the fence helper doesn't include user side interfaces and
 cache
  operation relevant codes anymore because not only we are not sure that
  coupling those two things, synchronizing caches and buffer access
 between
  CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is
 a
  good idea yet but also existing codes for user side have problems with
 badly
  behaved or crashing userspace. So this could be more discussed later.
 
  The below is a new branch,
 
  https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
 exynos.git/?h=dma-f
  ence-helper
 
  And fence helper codes,
 
  https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
 exynos.git/commit/?
  h=dma-fence-helperid=adcbc0fe7e285ce866e5816e5e21443dcce01005
 
  And example codes for device driver,
 
  https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
 exynos.git/commit/?
  h=dma-fence-helperid=d2ce7af23835789602a99d0ccef1f53cdd5caaae
 
  I think the time is not yet ripe for RFC posting: maybe existing dma
 fence
  and reservation need more review and addition work. So I'd glad for
 somebody
  giving other opinions and advices in advance before RFC posting.

 thoughts from a *really* quick, pre-coffee, first look:
 * any sort of helper to simplify single-buffer sort of use-cases (v4l)
 probably wouldn't want to bake in assumption that seqno_fence is used.
 * I guess g2d is probably not actually a simple use case, since I
 expect you can submit blits involving multiple buffers :-P

 I don't think so. One and more buffers can be used: seqno_fence also has
 only one buffer. Actually, we have already applied this approach to most
 devices; multimedia, gpu and display controller. And this approach shows
 more performance; reduced power consumption against traditional way. And g2d
 example is just to show you how to apply my approach to device driver.

no, you need the ww-mutex / reservation stuff any time you have
multiple independent devices (or rings/contexts for hw that can
support multiple contexts) which can do operations with multiple
buffers.  So you could conceivably hit this w/ gpu + g2d if multiple
buffers where shared between the two.  vram migration and such
'desktop stuff' might make the problem worse, but just because you
don't have vram doesn't mean you don't have a problem with multiple
buffers.

 * otherwise, you probably don't want to depend on dmabuf, which is why
 reservation/fence is split out the way it is..  you want to be able to
 use a single reservation/fence mechanism within your driver without
 having to care about which buffers are exported to dmabuf's and which
 are not.  Creating a dmabuf for every GEM bo is too heavyweight.

 Right. But I think we should dealt with this separately. Actually, we are
 trying to use reservation for gpu pipe line synchronization such as sgx sync
 object and this approach is used without dmabuf. In order words, some device
 can use only reservation for such pipe line synchronization and at the same
 time, fence helper or similar thing with dmabuf for buffer synchronization.

it is probably easier to approach from the reverse direction.. ie, get
reservation/synchronization right first, and then dmabuf.  (Well, that
isn't really a problem because Maarten's reservation/fence patches
support dmabuf from the beginning.)

BR,
-R


 I'm not entirely sure if reservation/fence could/should be made any
 simpler for multi-buffer users.  Probably the best thing to do is just
 get reservation/fence rolled out in a few drivers and see if some
 common patterns emerge.

 BR,
 -R

 
  Thanks,
  Inki Dae
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-fbdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Introduce a new helper framework for buffer synchronization

2013-05-28 Thread Inki Dae

Hi Daniel,

Thank you so much. And so very useful.:) Sorry but could be give me more
comments to the below my comments? There are still things making me
confusing.:(


 -Original Message-
 From: Daniel Vetter [mailto:daniel.vet...@ffwll.ch] On Behalf Of Daniel
 Vetter
 Sent: Tuesday, May 28, 2013 7:33 PM
 To: Inki Dae
 Cc: 'Rob Clark'; 'Maarten Lankhorst'; 'Daniel Vetter'; 'linux-fbdev';
 'YoungJun Cho'; 'Kyungmin Park'; 'myungjoo.ham'; 'DRI mailing list';
 linux-arm-ker...@lists.infradead.org; linux-media@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer synchronization
 
 On Tue, May 28, 2013 at 12:56:57PM +0900, Inki Dae wrote:
 
 
   -Original Message-
   From: linux-fbdev-ow...@vger.kernel.org [mailto:linux-fbdev-
   ow...@vger.kernel.org] On Behalf Of Rob Clark
   Sent: Tuesday, May 28, 2013 12:48 AM
   To: Inki Dae
   Cc: Maarten Lankhorst; Daniel Vetter; linux-fbdev; YoungJun Cho;
 Kyungmin
   Park; myungjoo.ham; DRI mailing list;
  linux-arm-ker...@lists.infradead.org;
   linux-media@vger.kernel.org
   Subject: Re: Introduce a new helper framework for buffer
 synchronization
  
   On Mon, May 27, 2013 at 6:38 AM, Inki Dae inki@samsung.com
wrote:
Hi all,
   
I have been removed previous branch and added new one with more
 cleanup.
This time, the fence helper doesn't include user side interfaces and
   cache
operation relevant codes anymore because not only we are not sure
 that
coupling those two things, synchronizing caches and buffer access
   between
CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel
 side is
   a
good idea yet but also existing codes for user side have problems
 with
   badly
behaved or crashing userspace. So this could be more discussed
later.
   
The below is a new branch,
   
https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
   exynos.git/?h=dma-f
ence-helper
   
And fence helper codes,
   
https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
   exynos.git/commit/?
h=dma-fence-helperid=adcbc0fe7e285ce866e5816e5e21443dcce01005
   
And example codes for device driver,
   
https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
   exynos.git/commit/?
h=dma-fence-helperid=d2ce7af23835789602a99d0ccef1f53cdd5caaae
   
I think the time is not yet ripe for RFC posting: maybe existing dma
   fence
and reservation need more review and addition work. So I'd glad for
   somebody
giving other opinions and advices in advance before RFC posting.
  
   thoughts from a *really* quick, pre-coffee, first look:
   * any sort of helper to simplify single-buffer sort of use-cases (v4l)
   probably wouldn't want to bake in assumption that seqno_fence is used.
   * I guess g2d is probably not actually a simple use case, since I
   expect you can submit blits involving multiple buffers :-P
 
  I don't think so. One and more buffers can be used: seqno_fence also has
  only one buffer. Actually, we have already applied this approach to most
  devices; multimedia, gpu and display controller. And this approach shows
  more performance; reduced power consumption against traditional way. And
 g2d
  example is just to show you how to apply my approach to device driver.
 
 Note that seqno_fence is an implementation pattern for a certain type of
 direct hw-hw synchronization which uses a shared dma_buf to exchange the
 sync cookie.

I'm afraid that I don't understand hw-hw synchronization. hw-hw
synchronization means that device has a hardware feature which supports
buffer synchronization hardware internally? And what is the sync cookie?

 The dma_buf attached to the seqno_fence has _nothing_ to do
 with the dma_buf the fence actually coordinates access to.
 
 I think that confusing is a large reason for why MaartenI don't
 understand what you want to achieve with your fence helpers. Currently
 they're using the seqno_fence, but totally not in a way the seqno_fence
 was meant to be used.
 
 Note that with the current code there is only a pointer from dma_bufs to
 the fence. The fence itself has _no_ pointer to the dma_buf it syncs. This
 shouldn't be a problem since the fence fastpath for already signalled
 fences is completely barrierlock free (it's just a load+bit-test), and
 fences are meant to be embedded into whatever dma tracking structure you
 already have, so no overhead there. The only ugly part is the fence
 refcounting, but I don't think we can drop that.

The below is the proposed way,
dma device has to create a fence before accessing a shared buffer, and then
check if there are other dma which are accessing the shared buffer; if exist
then the dma device should be blocked, and then  it sets the fence to
reservation object of the shared buffer. And then the dma begins access to
the shared buffer. And finally, the dma signals its own fence so that other
blocked can be waked up. However, if there was another dma blocked before
signaling

RE: Introduce a new helper framework for buffer synchronization

2013-05-28 Thread Inki Dae


 -Original Message-
 From: linux-fbdev-ow...@vger.kernel.org [mailto:linux-fbdev-
 ow...@vger.kernel.org] On Behalf Of Rob Clark
 Sent: Tuesday, May 28, 2013 10:49 PM
 To: Inki Dae
 Cc: Maarten Lankhorst; Daniel Vetter; linux-fbdev; YoungJun Cho; Kyungmin
 Park; myungjoo.ham; DRI mailing list;
linux-arm-ker...@lists.infradead.org;
 linux-media@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer synchronization
 
 On Mon, May 27, 2013 at 11:56 PM, Inki Dae inki@samsung.com wrote:
 
 
  -Original Message-
  From: linux-fbdev-ow...@vger.kernel.org [mailto:linux-fbdev-
  ow...@vger.kernel.org] On Behalf Of Rob Clark
  Sent: Tuesday, May 28, 2013 12:48 AM
  To: Inki Dae
  Cc: Maarten Lankhorst; Daniel Vetter; linux-fbdev; YoungJun Cho;
 Kyungmin
  Park; myungjoo.ham; DRI mailing list;
  linux-arm-ker...@lists.infradead.org;
  linux-media@vger.kernel.org
  Subject: Re: Introduce a new helper framework for buffer
 synchronization
 
  On Mon, May 27, 2013 at 6:38 AM, Inki Dae inki@samsung.com wrote:
   Hi all,
  
   I have been removed previous branch and added new one with more
 cleanup.
   This time, the fence helper doesn't include user side interfaces and
  cache
   operation relevant codes anymore because not only we are not sure
 that
   coupling those two things, synchronizing caches and buffer access
  between
   CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side
 is
  a
   good idea yet but also existing codes for user side have problems
 with
  badly
   behaved or crashing userspace. So this could be more discussed later.
  
   The below is a new branch,
  
   https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
  exynos.git/?h=dma-f
   ence-helper
  
   And fence helper codes,
  
   https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
  exynos.git/commit/?
   h=dma-fence-helperid=adcbc0fe7e285ce866e5816e5e21443dcce01005
  
   And example codes for device driver,
  
   https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
  exynos.git/commit/?
   h=dma-fence-helperid=d2ce7af23835789602a99d0ccef1f53cdd5caaae
  
   I think the time is not yet ripe for RFC posting: maybe existing dma
  fence
   and reservation need more review and addition work. So I'd glad for
  somebody
   giving other opinions and advices in advance before RFC posting.
 
  thoughts from a *really* quick, pre-coffee, first look:
  * any sort of helper to simplify single-buffer sort of use-cases (v4l)
  probably wouldn't want to bake in assumption that seqno_fence is used.
  * I guess g2d is probably not actually a simple use case, since I
  expect you can submit blits involving multiple buffers :-P
 
  I don't think so. One and more buffers can be used: seqno_fence also has
  only one buffer. Actually, we have already applied this approach to most
  devices; multimedia, gpu and display controller. And this approach shows
  more performance; reduced power consumption against traditional way. And
 g2d
  example is just to show you how to apply my approach to device driver.
 
 no, you need the ww-mutex / reservation stuff any time you have
 multiple independent devices (or rings/contexts for hw that can
 support multiple contexts) which can do operations with multiple
 buffers.

I think I already used reservation stuff any time in that way except
ww-mutex. And I'm not sure that embedded system really needs ww-mutex. If
there is any case, 
could you tell me the case? I really need more advice and understanding :)

Thanks,
Inki Dae

  So you could conceivably hit this w/ gpu + g2d if multiple
 buffers where shared between the two.  vram migration and such
 'desktop stuff' might make the problem worse, but just because you
 don't have vram doesn't mean you don't have a problem with multiple
 buffers.
 
  * otherwise, you probably don't want to depend on dmabuf, which is why
  reservation/fence is split out the way it is..  you want to be able to
  use a single reservation/fence mechanism within your driver without
  having to care about which buffers are exported to dmabuf's and which
  are not.  Creating a dmabuf for every GEM bo is too heavyweight.
 
  Right. But I think we should dealt with this separately. Actually, we
 are
  trying to use reservation for gpu pipe line synchronization such as sgx
 sync
  object and this approach is used without dmabuf. In order words, some
 device
  can use only reservation for such pipe line synchronization and at the
 same
  time, fence helper or similar thing with dmabuf for buffer
 synchronization.
 
 it is probably easier to approach from the reverse direction.. ie, get
 reservation/synchronization right first, and then dmabuf.  (Well, that
 isn't really a problem because Maarten's reservation/fence patches
 support dmabuf from the beginning.)
 
 BR,
 -R
 
 
  I'm not entirely sure if reservation/fence could/should be made any
  simpler for multi-buffer users.  Probably the best thing to do is just
  get reservation/fence

Re: Introduce a new helper framework for buffer synchronization

2013-05-28 Thread Daniel Vetter
On Tue, May 28, 2013 at 4:50 PM, Inki Dae inki@samsung.com wrote:
 I think I already used reservation stuff any time in that way except
 ww-mutex. And I'm not sure that embedded system really needs ww-mutex. If
 there is any case,
 could you tell me the case? I really need more advice and understanding :)

If you have only one driver, you can get away without ww_mutex.
drm/i915 does it, all buffer state is protected by dev-struct_mutex.
But as soon as you have multiple drivers sharing buffers with dma_buf
things will blow up.

Yep, current prime is broken and can lead to deadlocks.

In practice it doesn't (yet) matter since only the X server does the
sharing dance, and that one's single-threaded. Now you can claim that
since you have all buffers pinned in embedded gfx anyway, you don't
care. But both in desktop gfx and embedded gfx the real fun starts
once you put fences into the mix and link them up with buffers, then
every command submission risks that deadlock. Furthermore you can get
unlucky and construct a circle of fences waiting on each another (only
though if the fence singalling fires off the next batchbuffer
asynchronously).

To prevent such deadlocks you _absolutely_ need to lock _all_ buffers
that take part in a command submission at once. To do that you either
need a global lock (ugh) or ww_mutexes.

So ww_mutexes are the fundamental ingredient of all this, if you don't
see why you need them then everything piled on top is broken. I think
until you've understood why exactly we need ww_mutexes there's not
much point in discussing the finer issues of fences, reservation
objects and how to integrate it with dma_bufs exactly.

I'll try to clarify the motivating example in the ww_mutex
documentation a bit, but I dunno how else I could explain this ...

Yours, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Introduce a new helper framework for buffer synchronization

2013-05-28 Thread Inki Dae


 -Original Message-
 From: daniel.vet...@ffwll.ch [mailto:daniel.vet...@ffwll.ch] On Behalf Of
 Daniel Vetter
 Sent: Wednesday, May 29, 2013 1:50 AM
 To: Inki Dae
 Cc: Rob Clark; Maarten Lankhorst; linux-fbdev; YoungJun Cho; Kyungmin
Park;
 myungjoo.ham; DRI mailing list; linux-arm-ker...@lists.infradead.org;
 linux-media@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer synchronization
 
 On Tue, May 28, 2013 at 4:50 PM, Inki Dae inki@samsung.com wrote:
  I think I already used reservation stuff any time in that way except
  ww-mutex. And I'm not sure that embedded system really needs ww-mutex.
 If
  there is any case,
  could you tell me the case? I really need more advice and
 understanding :)
 
 If you have only one driver, you can get away without ww_mutex.
 drm/i915 does it, all buffer state is protected by dev-struct_mutex.
 But as soon as you have multiple drivers sharing buffers with dma_buf
 things will blow up.
 
 Yep, current prime is broken and can lead to deadlocks.
 
 In practice it doesn't (yet) matter since only the X server does the
 sharing dance, and that one's single-threaded. Now you can claim that
 since you have all buffers pinned in embedded gfx anyway, you don't
 care. But both in desktop gfx and embedded gfx the real fun starts
 once you put fences into the mix and link them up with buffers, then
 every command submission risks that deadlock. Furthermore you can get
 unlucky and construct a circle of fences waiting on each another (only
 though if the fence singalling fires off the next batchbuffer
 asynchronously).

In our case, we haven't ever experienced deadlock yet but there is still
possible to face with deadlock in case that a process is sharing two buffer
with another process like below,
Process A committed buffer A and  waits for buffer B,
Process B committed buffer B and waits for buffer A

That is deadlock and it seems that you say we can resolve deadlock issue
with ww-mutexes. And it seems that we can replace our block-wakeup mechanism
with mutex lock for more performance.

 
 To prevent such deadlocks you _absolutely_ need to lock _all_ buffers
 that take part in a command submission at once. To do that you either
 need a global lock (ugh) or ww_mutexes.
 
 So ww_mutexes are the fundamental ingredient of all this, if you don't
 see why you need them then everything piled on top is broken. I think
 until you've understood why exactly we need ww_mutexes there's not
 much point in discussing the finer issues of fences, reservation
 objects and how to integrate it with dma_bufs exactly.
 
 I'll try to clarify the motivating example in the ww_mutex
 documentation a bit, but I dunno how else I could explain this ...
 

I don't really want for you waste your time on me. I will trying to apply
ww-mutexes (v4) to the proposed framework for more understanding.

Thanks for your advices.:) 
Inki Dae

 Yours, Daniel
 --
 Daniel Vetter
 Software Engineer, Intel Corporation
 +41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Introduce a new helper framework for buffer synchronization

2013-05-27 Thread Inki Dae
Hi all,

I have been removed previous branch and added new one with more cleanup.
This time, the fence helper doesn't include user side interfaces and cache
operation relevant codes anymore because not only we are not sure that
coupling those two things, synchronizing caches and buffer access between
CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is a
good idea yet but also existing codes for user side have problems with badly
behaved or crashing userspace. So this could be more discussed later.

The below is a new branch,

https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/?h=dma-f
ence-helper

And fence helper codes,

https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
h=dma-fence-helperid=adcbc0fe7e285ce866e5816e5e21443dcce01005

And example codes for device driver,

https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
h=dma-fence-helperid=d2ce7af23835789602a99d0ccef1f53cdd5caaae

I think the time is not yet ripe for RFC posting: maybe existing dma fence
and reservation need more review and addition work. So I'd glad for somebody
giving other opinions and advices in advance before RFC posting.

Thanks,
Inki Dae

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Introduce a new helper framework for buffer synchronization

2013-05-27 Thread Maarten Lankhorst
Hey,

Op 27-05-13 12:38, Inki Dae schreef:
 Hi all,

 I have been removed previous branch and added new one with more cleanup.
 This time, the fence helper doesn't include user side interfaces and cache
 operation relevant codes anymore because not only we are not sure that
 coupling those two things, synchronizing caches and buffer access between
 CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is a
 good idea yet but also existing codes for user side have problems with badly
 behaved or crashing userspace. So this could be more discussed later.

 The below is a new branch,
   
 https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/?h=dma-f
 ence-helper

 And fence helper codes,
   
 https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
 h=dma-fence-helperid=adcbc0fe7e285ce866e5816e5e21443dcce01005

 And example codes for device driver,
   
 https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
 h=dma-fence-helperid=d2ce7af23835789602a99d0ccef1f53cdd5caaae

 I think the time is not yet ripe for RFC posting: maybe existing dma fence
 and reservation need more review and addition work. So I'd glad for somebody
 giving other opinions and advices in advance before RFC posting.

NAK.

For examples for how to handle locking properly, see 
Documentation/ww-mutex-design.txt in my recent tree.
I could list what I believe is wrong with your implementation, but real problem 
is that the approach you're taking is wrong.

~Maarten
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Introduce a new helper framework for buffer synchronization

2013-05-27 Thread Rob Clark
On Mon, May 27, 2013 at 6:38 AM, Inki Dae inki@samsung.com wrote:
 Hi all,

 I have been removed previous branch and added new one with more cleanup.
 This time, the fence helper doesn't include user side interfaces and cache
 operation relevant codes anymore because not only we are not sure that
 coupling those two things, synchronizing caches and buffer access between
 CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is a
 good idea yet but also existing codes for user side have problems with badly
 behaved or crashing userspace. So this could be more discussed later.

 The below is a new branch,

 https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/?h=dma-f
 ence-helper

 And fence helper codes,

 https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
 h=dma-fence-helperid=adcbc0fe7e285ce866e5816e5e21443dcce01005

 And example codes for device driver,

 https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
 h=dma-fence-helperid=d2ce7af23835789602a99d0ccef1f53cdd5caaae

 I think the time is not yet ripe for RFC posting: maybe existing dma fence
 and reservation need more review and addition work. So I'd glad for somebody
 giving other opinions and advices in advance before RFC posting.

thoughts from a *really* quick, pre-coffee, first look:
* any sort of helper to simplify single-buffer sort of use-cases (v4l)
probably wouldn't want to bake in assumption that seqno_fence is used.
* I guess g2d is probably not actually a simple use case, since I
expect you can submit blits involving multiple buffers :-P
* otherwise, you probably don't want to depend on dmabuf, which is why
reservation/fence is split out the way it is..  you want to be able to
use a single reservation/fence mechanism within your driver without
having to care about which buffers are exported to dmabuf's and which
are not.  Creating a dmabuf for every GEM bo is too heavyweight.

I'm not entirely sure if reservation/fence could/should be made any
simpler for multi-buffer users.  Probably the best thing to do is just
get reservation/fence rolled out in a few drivers and see if some
common patterns emerge.

BR,
-R


 Thanks,
 Inki Dae

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Introduce a new helper framework for buffer synchronization

2013-05-27 Thread Daniel Vetter
On Mon, May 27, 2013 at 5:47 PM, Rob Clark robdcl...@gmail.com wrote:
 On Mon, May 27, 2013 at 6:38 AM, Inki Dae inki@samsung.com wrote:
 Hi all,

 I have been removed previous branch and added new one with more cleanup.
 This time, the fence helper doesn't include user side interfaces and cache
 operation relevant codes anymore because not only we are not sure that
 coupling those two things, synchronizing caches and buffer access between
 CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is a
 good idea yet but also existing codes for user side have problems with badly
 behaved or crashing userspace. So this could be more discussed later.

 The below is a new branch,

 https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/?h=dma-f
 ence-helper

 And fence helper codes,

 https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
 h=dma-fence-helperid=adcbc0fe7e285ce866e5816e5e21443dcce01005

 And example codes for device driver,

 https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/commit/?
 h=dma-fence-helperid=d2ce7af23835789602a99d0ccef1f53cdd5caaae

 I think the time is not yet ripe for RFC posting: maybe existing dma fence
 and reservation need more review and addition work. So I'd glad for somebody
 giving other opinions and advices in advance before RFC posting.

 thoughts from a *really* quick, pre-coffee, first look:
 * any sort of helper to simplify single-buffer sort of use-cases (v4l)
 probably wouldn't want to bake in assumption that seqno_fence is used.

Yeah, which is why MaartenI discussed ideas already for what needs to
be improved in the current dma-buf interface code to make this Just
Work. At least as long as a driver doesn't want to add new fences,
which would be especially useful for all kinds of gpu access.

 * I guess g2d is probably not actually a simple use case, since I
 expect you can submit blits involving multiple buffers :-P

Yeah, on a quick read the current fence helper code seems to be a bit
limited in scope.

 * otherwise, you probably don't want to depend on dmabuf, which is why
 reservation/fence is split out the way it is..  you want to be able to
 use a single reservation/fence mechanism within your driver without
 having to care about which buffers are exported to dmabuf's and which
 are not.  Creating a dmabuf for every GEM bo is too heavyweight.

That's pretty much the reason that reservations are free-standing from
dma_bufs. The idea is to embed them into the gem/ttm/v4l buffer
object. Maarten also has some helpers to keep track of multi-buffer
ww_mutex locking and fence attaching in his reservation helpers, but I
think we should wait with those until we have drivers using them.

For now I think the priority should be to get the basic stuff in and
ttm as the first user established. Then we can go nuts later on.

 I'm not entirely sure if reservation/fence could/should be made any
 simpler for multi-buffer users.  Probably the best thing to do is just
 get reservation/fence rolled out in a few drivers and see if some
 common patterns emerge.

I think we can make the 1 buffer per dma op (i.e. 1:1
dma_buf-reservation : fence mapping) work fairly simple in dma_buf
with maybe a dma_buf_attachment_start_dma/end_dma helpers. But there's
also still the open that currently there's no way to flush cpu caches
for dma access without unmapping the attachement (or resorting to
trick which might not work), so we have a few gaping holes in the
interface already anyway.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Introduce a new helper framework for buffer synchronization

2013-05-27 Thread Inki Dae


 -Original Message-
 From: Maarten Lankhorst [mailto:maarten.lankho...@canonical.com]
 Sent: Tuesday, May 28, 2013 12:23 AM
 To: Inki Dae
 Cc: 'Daniel Vetter'; 'Rob Clark'; 'linux-fbdev'; 'YoungJun Cho'; 'Kyungmin
 Park'; 'myungjoo.ham'; 'DRI mailing list'; linux-arm-
 ker...@lists.infradead.org; linux-media@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer synchronization
 
 Hey,
 
 Op 27-05-13 12:38, Inki Dae schreef:
  Hi all,
 
  I have been removed previous branch and added new one with more cleanup.
  This time, the fence helper doesn't include user side interfaces and
 cache
  operation relevant codes anymore because not only we are not sure that
  coupling those two things, synchronizing caches and buffer access
 between
  CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is
 a
  good idea yet but also existing codes for user side have problems with
 badly
  behaved or crashing userspace. So this could be more discussed later.
 
  The below is a new branch,
 
  https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
 exynos.git/?h=dma-f
  ence-helper
 
  And fence helper codes,
 
  https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
 exynos.git/commit/?
  h=dma-fence-helperid=adcbc0fe7e285ce866e5816e5e21443dcce01005
 
  And example codes for device driver,
 
  https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
 exynos.git/commit/?
  h=dma-fence-helperid=d2ce7af23835789602a99d0ccef1f53cdd5caaae
 
  I think the time is not yet ripe for RFC posting: maybe existing dma
 fence
  and reservation need more review and addition work. So I'd glad for
 somebody
  giving other opinions and advices in advance before RFC posting.
 
 NAK.
 
 For examples for how to handle locking properly, see Documentation/ww-
 mutex-design.txt in my recent tree.
 I could list what I believe is wrong with your implementation, but real
 problem is that the approach you're taking is wrong.

I just removed ticket stubs to show my approach you guys as simple as
possible, and I just wanted to show that we could use buffer synchronization
mechanism without ticket stubs.

Question, WW-Mutexes could be used for all devices? I guess this has
dependence on x86 gpu: gpu has VRAM and it means different memory domain.
And could you tell my why shared fence should have only eight objects? I
think we could need more than eight objects for read access. Anyway I think
I don't surely understand yet so there might be my missing point.

Thanks,
Inki Dae

 

 ~Maarten

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Introduce a new helper framework for buffer synchronization

2013-05-27 Thread Inki Dae


 -Original Message-
 From: linux-fbdev-ow...@vger.kernel.org [mailto:linux-fbdev-
 ow...@vger.kernel.org] On Behalf Of Rob Clark
 Sent: Tuesday, May 28, 2013 12:48 AM
 To: Inki Dae
 Cc: Maarten Lankhorst; Daniel Vetter; linux-fbdev; YoungJun Cho; Kyungmin
 Park; myungjoo.ham; DRI mailing list;
linux-arm-ker...@lists.infradead.org;
 linux-media@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer synchronization
 
 On Mon, May 27, 2013 at 6:38 AM, Inki Dae inki@samsung.com wrote:
  Hi all,
 
  I have been removed previous branch and added new one with more cleanup.
  This time, the fence helper doesn't include user side interfaces and
 cache
  operation relevant codes anymore because not only we are not sure that
  coupling those two things, synchronizing caches and buffer access
 between
  CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side is
 a
  good idea yet but also existing codes for user side have problems with
 badly
  behaved or crashing userspace. So this could be more discussed later.
 
  The below is a new branch,
 
  https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
 exynos.git/?h=dma-f
  ence-helper
 
  And fence helper codes,
 
  https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
 exynos.git/commit/?
  h=dma-fence-helperid=adcbc0fe7e285ce866e5816e5e21443dcce01005
 
  And example codes for device driver,
 
  https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
 exynos.git/commit/?
  h=dma-fence-helperid=d2ce7af23835789602a99d0ccef1f53cdd5caaae
 
  I think the time is not yet ripe for RFC posting: maybe existing dma
 fence
  and reservation need more review and addition work. So I'd glad for
 somebody
  giving other opinions and advices in advance before RFC posting.
 
 thoughts from a *really* quick, pre-coffee, first look:
 * any sort of helper to simplify single-buffer sort of use-cases (v4l)
 probably wouldn't want to bake in assumption that seqno_fence is used.
 * I guess g2d is probably not actually a simple use case, since I
 expect you can submit blits involving multiple buffers :-P

I don't think so. One and more buffers can be used: seqno_fence also has
only one buffer. Actually, we have already applied this approach to most
devices; multimedia, gpu and display controller. And this approach shows
more performance; reduced power consumption against traditional way. And g2d
example is just to show you how to apply my approach to device driver.

 * otherwise, you probably don't want to depend on dmabuf, which is why
 reservation/fence is split out the way it is..  you want to be able to
 use a single reservation/fence mechanism within your driver without
 having to care about which buffers are exported to dmabuf's and which
 are not.  Creating a dmabuf for every GEM bo is too heavyweight.

Right. But I think we should dealt with this separately. Actually, we are
trying to use reservation for gpu pipe line synchronization such as sgx sync
object and this approach is used without dmabuf. In order words, some device
can use only reservation for such pipe line synchronization and at the same
time, fence helper or similar thing with dmabuf for buffer synchronization.

 
 I'm not entirely sure if reservation/fence could/should be made any
 simpler for multi-buffer users.  Probably the best thing to do is just
 get reservation/fence rolled out in a few drivers and see if some
 common patterns emerge.
 
 BR,
 -R
 
 
  Thanks,
  Inki Dae
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-fbdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Introduce a new helper framework for buffer synchronization

2013-05-27 Thread Inki Dae


 -Original Message-
 From: daniel.vet...@ffwll.ch [mailto:daniel.vet...@ffwll.ch] On Behalf Of
 Daniel Vetter
 Sent: Tuesday, May 28, 2013 1:02 AM
 To: Rob Clark
 Cc: Inki Dae; Maarten Lankhorst; linux-fbdev; YoungJun Cho; Kyungmin Park;
 myungjoo.ham; DRI mailing list; linux-arm-ker...@lists.infradead.org;
 linux-media@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer synchronization
 
 On Mon, May 27, 2013 at 5:47 PM, Rob Clark robdcl...@gmail.com wrote:
  On Mon, May 27, 2013 at 6:38 AM, Inki Dae inki@samsung.com wrote:
  Hi all,
 
  I have been removed previous branch and added new one with more
cleanup.
  This time, the fence helper doesn't include user side interfaces and
 cache
  operation relevant codes anymore because not only we are not sure that
  coupling those two things, synchronizing caches and buffer access
 between
  CPU and CPU, CPU and DMA, and DMA and DMA with fences, in kernel side
 is a
  good idea yet but also existing codes for user side have problems with
 badly
  behaved or crashing userspace. So this could be more discussed later.
 
  The below is a new branch,
 
  https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
 exynos.git/?h=dma-f
  ence-helper
 
  And fence helper codes,
 
  https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
 exynos.git/commit/?
  h=dma-fence-helperid=adcbc0fe7e285ce866e5816e5e21443dcce01005
 
  And example codes for device driver,
 
  https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-
 exynos.git/commit/?
  h=dma-fence-helperid=d2ce7af23835789602a99d0ccef1f53cdd5caaae
 
  I think the time is not yet ripe for RFC posting: maybe existing dma
 fence
  and reservation need more review and addition work. So I'd glad for
 somebody
  giving other opinions and advices in advance before RFC posting.
 
  thoughts from a *really* quick, pre-coffee, first look:
  * any sort of helper to simplify single-buffer sort of use-cases (v4l)
  probably wouldn't want to bake in assumption that seqno_fence is used.
 
 Yeah, which is why MaartenI discussed ideas already for what needs to
 be improved in the current dma-buf interface code to make this Just
 Work. At least as long as a driver doesn't want to add new fences,
 which would be especially useful for all kinds of gpu access.
 
  * I guess g2d is probably not actually a simple use case, since I
  expect you can submit blits involving multiple buffers :-P
 
 Yeah, on a quick read the current fence helper code seems to be a bit
 limited in scope.
 
  * otherwise, you probably don't want to depend on dmabuf, which is why
  reservation/fence is split out the way it is..  you want to be able to
  use a single reservation/fence mechanism within your driver without
  having to care about which buffers are exported to dmabuf's and which
  are not.  Creating a dmabuf for every GEM bo is too heavyweight.
 
 That's pretty much the reason that reservations are free-standing from
 dma_bufs. The idea is to embed them into the gem/ttm/v4l buffer
 object. Maarten also has some helpers to keep track of multi-buffer
 ww_mutex locking and fence attaching in his reservation helpers, but I
 think we should wait with those until we have drivers using them.
 
 For now I think the priority should be to get the basic stuff in and
 ttm as the first user established. Then we can go nuts later on.
 
  I'm not entirely sure if reservation/fence could/should be made any
  simpler for multi-buffer users.  Probably the best thing to do is just
  get reservation/fence rolled out in a few drivers and see if some
  common patterns emerge.
 
 I think we can make the 1 buffer per dma op (i.e. 1:1
 dma_buf-reservation : fence mapping) work fairly simple in dma_buf
 with maybe a dma_buf_attachment_start_dma/end_dma helpers. But there's
 also still the open that currently there's no way to flush cpu caches
 for dma access without unmapping the attachement (or resorting to


That was what I tried adding user interfaces to dmabuf: coupling
synchronizing caches and buffer access between CPU and CPU, CPU and DMA, and
DMA and DMA with fences in kernel side. We need something to do between
mapping and unmapping attachment.

 trick which might not work), so we have a few gaping holes in the
 interface already anyway.
 -Daniel
 --
 Daniel Vetter
 Software Engineer, Intel Corporation
 +41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Introduce a new helper framework for buffer synchronization

2013-05-23 Thread Daniel Vetter
On Tue, May 21, 2013 at 11:22 AM, Inki Dae inki@samsung.com wrote:
 -Original Message-
 From: Daniel Vetter [mailto:daniel.vet...@ffwll.ch] On Behalf Of Daniel
 Vetter
 Sent: Tuesday, May 21, 2013 4:45 PM
 To: Inki Dae
 Cc: 'Daniel Vetter'; 'Rob Clark'; 'linux-fbdev'; 'DRI mailing list';
 'Kyungmin Park'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
 ker...@lists.infradead.org; linux-media@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer synchronization

 On Tue, May 21, 2013 at 04:03:06PM +0900, Inki Dae wrote:
   - Integration of fence syncing into dma_buf. Imo we should have a
 per-attachment mode which decides whether map/unmap (and the new
 sync)
 should wait for fences or whether the driver takes care of syncing
 through the new fence framework itself (for direct hw sync).
 
  I think it's a good idea to have per-attachment mode for buffer sync.
 But
  I'd like to say we make sure what is the purpose of map
  (dma_buf_map_attachment)first. This interface is used to get a sgt;
  containing pages to physical memory region, and map that region with
  device's iommu table. The important thing is that this interface is
 called
  just one time when user wants to share an allocated buffer with dma. But
 cpu
  will try to access the buffer every time as long as it wants. Therefore,
 we
  need cache control every time cpu and dma access the shared buffer:
 cache
  clean when cpu goes to dma and cache invalidate when dma goes to cpu.
 That
  is why I added new interfaces, DMA_BUF_GET_FENCE and DMA_BUF_PUT_FENCE,
 to
  dma buf framework. Of course, Those are ugly so we need a better way: I
 just
  wanted to show you that in such way, we can synchronize the shared
 buffer
  between cpu and dma. By any chance, is there any way that kernel can be
  aware of when cpu accesses the shared buffer or is there any point I
 didn't
  catch up?

 Well dma_buf_map/unmap_attachment should also flush/invalidate any caches,
 and with the current dma_buf spec those two functions are the _only_ means

 I know that dma buf exporter should make sure that cache clean/invalidate
 are done when dma_buf_map/unmap_attachement is called. For this, already we
 do so. However, this function is called when dma buf import is requested by
 user to map a dmabuf fd with dma. This means that dma_buf_map_attachement()
 is called ONCE when user wants to share the dmabuf fd with dma. Actually, in
 case of exynos drm, dma_map_sg_attrs(), performing cache operation
 internally, is called when dmabuf import is requested by user.

 you have to do so. Which strictly means that if you interleave device dma
 and cpu acccess you need to unmap/map every time.

 Which is obviously horribly inefficient, but a known gap in the dma_buf

 Right, and also this has big overhead.

 interface. Hence why I proposed to add dma_buf_sync_attachment similar to
 dma_sync_single_for_device:

 /**
  * dma_buf_sync_sg_attachment - sync caches for dma access
  * @attach: dma-buf attachment to sync
  * @sgt: the sg table to sync (returned by dma_buf_map_attachement)
  * @direction: dma direction to sync for
  *
  * This function synchronizes caches for device dma through the given
  * dma-buf attachment when interleaving dma from different devices and the
  * cpu. Other device dma needs to be synced also by calls to this
  * function (or a pair of dma_buf_map/unmap_attachment calls), cpu access
  * needs to be synced with dma_buf_begin/end_cpu_access.
  */
 void dma_buf_sync_sg_attachment(struct dma_buf_attachment *attach,
   struct sg_table *sgt,
   enum dma_data_direction direction)

 Note that sync here only means to synchronize caches, not wait for any
 outstanding fences. This is simply to be consistent with the established
 lingo of the dma api. How the dma-buf fences fit into this is imo a
 different topic, but my idea is that all the cache coherency barriers
 (i.e. dma_buf_map/unmap_attachment, dma_buf_sync_sg_attachment and
 dma_buf_begin/end_cpu_access) would automatically block for any attached
 fences (i.e. block for write fences when doing read-only access, block for
 all fences otherwise).

 As I mentioned already, kernel can't aware of when cpu accesses a shared
 buffer: user can access a shared buffer after mmap anytime and the shared
 buffer should be synchronized between cpu and dma. Therefore, the above
 cache coherency barriers should be called every time cpu and dma tries to
 access a shared buffer, checking before and after of cpu and dma access. And
 exactly, the proposed way do such thing. For this, you can refer to the
 below link,

 http://www.mail-archive.com/linux-media@vger.kernel.org/msg62124.html

 My point is that how kernel can aware of when those cache coherency barriers
 should be called to synchronize caches and buffer access between cpu and
 dma.


 Then we could do a new dma_buf_attach_flags interface for special cases
 (might also

RE: Introduce a new helper framework for buffer synchronization

2013-05-23 Thread Inki Dae
 -Original Message-
 From: daniel.vet...@ffwll.ch [mailto:daniel.vet...@ffwll.ch] On Behalf Of
 Daniel Vetter
 Sent: Thursday, May 23, 2013 8:56 PM
 To: Inki Dae
 Cc: Rob Clark; linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham;
 YoungJun Cho; linux-arm-ker...@lists.infradead.org; linux-
 me...@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer synchronization
 
 On Tue, May 21, 2013 at 11:22 AM, Inki Dae inki@samsung.com wrote:
  -Original Message-
  From: Daniel Vetter [mailto:daniel.vet...@ffwll.ch] On Behalf Of Daniel
  Vetter
  Sent: Tuesday, May 21, 2013 4:45 PM
  To: Inki Dae
  Cc: 'Daniel Vetter'; 'Rob Clark'; 'linux-fbdev'; 'DRI mailing list';
  'Kyungmin Park'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
  ker...@lists.infradead.org; linux-media@vger.kernel.org
  Subject: Re: Introduce a new helper framework for buffer
 synchronization
 
  On Tue, May 21, 2013 at 04:03:06PM +0900, Inki Dae wrote:
- Integration of fence syncing into dma_buf. Imo we should have a
  per-attachment mode which decides whether map/unmap (and the new
  sync)
  should wait for fences or whether the driver takes care of
 syncing
  through the new fence framework itself (for direct hw sync).
  
   I think it's a good idea to have per-attachment mode for buffer sync.
  But
   I'd like to say we make sure what is the purpose of map
   (dma_buf_map_attachment)first. This interface is used to get a sgt;
   containing pages to physical memory region, and map that region with
   device's iommu table. The important thing is that this interface is
  called
   just one time when user wants to share an allocated buffer with dma.
 But
  cpu
   will try to access the buffer every time as long as it wants.
 Therefore,
  we
   need cache control every time cpu and dma access the shared buffer:
  cache
   clean when cpu goes to dma and cache invalidate when dma goes to cpu.
  That
   is why I added new interfaces, DMA_BUF_GET_FENCE and
 DMA_BUF_PUT_FENCE,
  to
   dma buf framework. Of course, Those are ugly so we need a better way:
 I
  just
   wanted to show you that in such way, we can synchronize the shared
  buffer
   between cpu and dma. By any chance, is there any way that kernel can
 be
   aware of when cpu accesses the shared buffer or is there any point I
  didn't
   catch up?
 
  Well dma_buf_map/unmap_attachment should also flush/invalidate any
 caches,
  and with the current dma_buf spec those two functions are the _only_
 means
 
  I know that dma buf exporter should make sure that cache
 clean/invalidate
  are done when dma_buf_map/unmap_attachement is called. For this, already
 we
  do so. However, this function is called when dma buf import is requested
 by
  user to map a dmabuf fd with dma. This means that
 dma_buf_map_attachement()
  is called ONCE when user wants to share the dmabuf fd with dma.
Actually,
 in
  case of exynos drm, dma_map_sg_attrs(), performing cache operation
  internally, is called when dmabuf import is requested by user.
 
  you have to do so. Which strictly means that if you interleave device
 dma
  and cpu acccess you need to unmap/map every time.
 
  Which is obviously horribly inefficient, but a known gap in the dma_buf
 
  Right, and also this has big overhead.
 
  interface. Hence why I proposed to add dma_buf_sync_attachment similar
 to
  dma_sync_single_for_device:
 
  /**
   * dma_buf_sync_sg_attachment - sync caches for dma access
   * @attach: dma-buf attachment to sync
   * @sgt: the sg table to sync (returned by dma_buf_map_attachement)
   * @direction: dma direction to sync for
   *
   * This function synchronizes caches for device dma through the given
   * dma-buf attachment when interleaving dma from different devices and
 the
   * cpu. Other device dma needs to be synced also by calls to this
   * function (or a pair of dma_buf_map/unmap_attachment calls), cpu
 access
   * needs to be synced with dma_buf_begin/end_cpu_access.
   */
  void dma_buf_sync_sg_attachment(struct dma_buf_attachment *attach,
struct sg_table *sgt,
enum dma_data_direction direction)
 
  Note that sync here only means to synchronize caches, not wait for
 any
  outstanding fences. This is simply to be consistent with the
 established
  lingo of the dma api. How the dma-buf fences fit into this is imo a
  different topic, but my idea is that all the cache coherency barriers
  (i.e. dma_buf_map/unmap_attachment, dma_buf_sync_sg_attachment and
  dma_buf_begin/end_cpu_access) would automatically block for any
 attached
  fences (i.e. block for write fences when doing read-only access, block
 for
  all fences otherwise).
 
  As I mentioned already, kernel can't aware of when cpu accesses a shared
  buffer: user can access a shared buffer after mmap anytime and the
 shared
  buffer should be synchronized between cpu and dma. Therefore, the above
  cache coherency barriers should

RE: Introduce a new helper framework for buffer synchronization

2013-05-21 Thread Inki Dae


 -Original Message-
 From: Daniel Vetter [mailto:daniel.vet...@ffwll.ch] On Behalf Of Daniel
 Vetter
 Sent: Tuesday, May 21, 2013 6:31 AM
 To: Inki Dae
 Cc: Rob Clark; linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham;
 YoungJun Cho; linux-arm-ker...@lists.infradead.org; linux-
 me...@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer synchronization
 
 On Mon, May 20, 2013 at 11:13:04PM +0200, Daniel Vetter wrote:
  On Sat, May 18, 2013 at 03:47:43PM +0900, Inki Dae wrote:
   2013/5/15 Rob Clark robdcl...@gmail.com
  
On Wed, May 15, 2013 at 1:19 AM, Inki Dae inki@samsung.com
 wrote:


 -Original Message-
 From: Rob Clark [mailto:robdcl...@gmail.com]
 Sent: Tuesday, May 14, 2013 10:39 PM
 To: Inki Dae
 Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham;
 YoungJun
 Cho; linux-arm-ker...@lists.infradead.org; linux-
 me...@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer
 synchronization

 On Mon, May 13, 2013 at 10:52 PM, Inki Dae inki@samsung.com
wrote:
  well, for cache management, I think it is a better idea.. I
 didn't
  really catch that this was the motivation from the initial
 patch, but
  maybe I read it too quickly.  But cache can be decoupled from
  synchronization, because CPU access is not asynchronous.  For
  userspace/CPU access to buffer, you should:
 
1) wait for buffer
2) prepare-access
3)  ... do whatever cpu access to buffer ...
4) finish-access
5) submit buffer for new dma-operation
 
 
 
  For data flow from CPU to DMA device,
  1) wait for buffer
  2) prepare-access (dma_buf_begin_cpu_access)
  3) cpu access to buffer
 
 
  For data flow from DMA device to CPU
  1) wait for buffer

 Right, but CPU access isn't asynchronous (from the point of view
 of
 the CPU), so there isn't really any wait step at this point.  And
 if
 you do want the CPU to be able to signal a fence from userspace
 for
 some reason, you probably what something file/fd based so the
 refcnting/cleanup when process dies doesn't leave some pending
 DMA
 action wedged.  But I don't really see the point of that
 complexity
 when the CPU access isn't asynchronous in the first place.


 There was my missing comments, please see the below sequence.

 For data flow from CPU to DMA device and then from DMA device to
 CPU,
 1) wait for buffer - at user side - ioctl(fd,
 DMA_BUF_GET_FENCE, ...)
 - including prepare-access (dma_buf_begin_cpu_access)
 2) cpu access to buffer
 3) wait for buffer - at device driver
 - but CPU is already accessing the buffer so blocked.
 4) signal - at user side - ioctl(fd, DMA_BUF_PUT_FENCE, ...)
 5) the thread, blocked at 3), is waked up by 4).
 - and then finish-access (dma_buf_end_cpu_access)
   
right, I understand you can have background threads, etc, in
userspace.  But there are already plenty of synchronization
 primitives
that can be used for cpu-cpu synchronization, either within the
 same
process or between multiple processes.  For cpu access, even if it
 is
handled by background threads/processes, I think it is better to use
the traditional pthreads or unix synchronization primitives.  They
have existed forever, they are well tested, and they work.
   
So while it seems nice and orthogonal/clean to couple cache and
synchronization and handle dma-cpu and cpu-cpu and cpu-dma in the
same generic way, but I think in practice we have to make things
 more
complex than they otherwise need to be to do this.  Otherwise I
 think
we'll be having problems with badly behaved or crashing userspace.
   
   
   Right, this is very important. I think it's not esay to solve this
   issue. Aand at least for ARM based embedded system, such feature is
 useful
   to us; coupling cache operation and buffer synchronization. I'd like
 to
   collect other opinions and advices to solve this issue.
 
  Maybe we have a bit a misunderstanding here. The kernel really should
 take
  care of sync and cache coherency, and I agree that with the current
  dma_buf code (and the proposed fences) that part is still a bit hazy.
 But
  the kernel should not allow userspace to block access to a buffer until
  userspace is done. It should only sync with any oustanding fences and
  flush buffers before that userspace access happens.
 
  Then the kernel would again flush caches on the next dma access (which
  hopefully is only started once userspace completed access). Atm this
 isn't
  possible in an efficient way since the dma_buf api only exposes
 map/unmap
  attachment and not a function to just sync an existing mapping like
  dma_sync_single_for_device. I guess we should add a
  dma_buf_sync_attachment interface

Re: Introduce a new helper framework for buffer synchronization

2013-05-21 Thread Daniel Vetter
On Tue, May 21, 2013 at 04:03:06PM +0900, Inki Dae wrote:
  - Integration of fence syncing into dma_buf. Imo we should have a
per-attachment mode which decides whether map/unmap (and the new sync)
should wait for fences or whether the driver takes care of syncing
through the new fence framework itself (for direct hw sync).
 
 I think it's a good idea to have per-attachment mode for buffer sync. But
 I'd like to say we make sure what is the purpose of map
 (dma_buf_map_attachment)first. This interface is used to get a sgt;
 containing pages to physical memory region, and map that region with
 device's iommu table. The important thing is that this interface is called
 just one time when user wants to share an allocated buffer with dma. But cpu
 will try to access the buffer every time as long as it wants. Therefore, we
 need cache control every time cpu and dma access the shared buffer: cache
 clean when cpu goes to dma and cache invalidate when dma goes to cpu. That
 is why I added new interfaces, DMA_BUF_GET_FENCE and DMA_BUF_PUT_FENCE, to
 dma buf framework. Of course, Those are ugly so we need a better way: I just
 wanted to show you that in such way, we can synchronize the shared buffer
 between cpu and dma. By any chance, is there any way that kernel can be
 aware of when cpu accesses the shared buffer or is there any point I didn't
 catch up?

Well dma_buf_map/unmap_attachment should also flush/invalidate any caches,
and with the current dma_buf spec those two functions are the _only_ means
you have to do so. Which strictly means that if you interleave device dma
and cpu acccess you need to unmap/map every time.

Which is obviously horribly inefficient, but a known gap in the dma_buf
interface. Hence why I proposed to add dma_buf_sync_attachment similar to
dma_sync_single_for_device:

/**
 * dma_buf_sync_sg_attachment - sync caches for dma access
 * @attach: dma-buf attachment to sync
 * @sgt: the sg table to sync (returned by dma_buf_map_attachement)
 * @direction: dma direction to sync for
 *
 * This function synchronizes caches for device dma through the given
 * dma-buf attachment when interleaving dma from different devices and the
 * cpu. Other device dma needs to be synced also by calls to this
 * function (or a pair of dma_buf_map/unmap_attachment calls), cpu access
 * needs to be synced with dma_buf_begin/end_cpu_access.
 */
void dma_buf_sync_sg_attachment(struct dma_buf_attachment *attach,
struct sg_table *sgt,
enum dma_data_direction direction)

Note that sync here only means to synchronize caches, not wait for any
outstanding fences. This is simply to be consistent with the established
lingo of the dma api. How the dma-buf fences fit into this is imo a
different topic, but my idea is that all the cache coherency barriers
(i.e. dma_buf_map/unmap_attachment, dma_buf_sync_sg_attachment and
dma_buf_begin/end_cpu_access) would automatically block for any attached
fences (i.e. block for write fences when doing read-only access, block for
all fences otherwise).

Then we could do a new dma_buf_attach_flags interface for special cases
(might also be useful for other things, similar to the recently added
flags in the dma api for wc/no-kernel-mapping/...). A new flag like
DMA_BUF_ATTACH_NO_AUTOFENCE would then mean that the driver takes care of all
fencing for this attachment and the dma-buf functions should not do the
automagic fence blocking.

 In Linux kernel, especially embedded system, we had tried to implement
 generic interfaces for buffer management; how to allocate a buffer and how
 to share a buffer. As a result, now we have CMA (Contiguous Memory
 Allocator) for buffer allocation, and we have DMA-BUF for buffer sharing
 between cpu and dma. And then how to synchronize a buffer between cpu and
 dma? I think now it's best time to discuss buffer synchronization mechanism,
 and that is very natural.

I think it's important to differentiate between the two meanings of sync:
- synchronize caches (i.e. cpu cache flushing in most cases)
- and synchronize in-flight dma with fences.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Introduce a new helper framework for buffer synchronization

2013-05-21 Thread Inki Dae


 -Original Message-
 From: Daniel Vetter [mailto:daniel.vet...@ffwll.ch] On Behalf Of Daniel
 Vetter
 Sent: Tuesday, May 21, 2013 4:45 PM
 To: Inki Dae
 Cc: 'Daniel Vetter'; 'Rob Clark'; 'linux-fbdev'; 'DRI mailing list';
 'Kyungmin Park'; 'myungjoo.ham'; 'YoungJun Cho'; linux-arm-
 ker...@lists.infradead.org; linux-media@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer synchronization
 
 On Tue, May 21, 2013 at 04:03:06PM +0900, Inki Dae wrote:
   - Integration of fence syncing into dma_buf. Imo we should have a
 per-attachment mode which decides whether map/unmap (and the new
 sync)
 should wait for fences or whether the driver takes care of syncing
 through the new fence framework itself (for direct hw sync).
 
  I think it's a good idea to have per-attachment mode for buffer sync.
 But
  I'd like to say we make sure what is the purpose of map
  (dma_buf_map_attachment)first. This interface is used to get a sgt;
  containing pages to physical memory region, and map that region with
  device's iommu table. The important thing is that this interface is
 called
  just one time when user wants to share an allocated buffer with dma. But
 cpu
  will try to access the buffer every time as long as it wants. Therefore,
 we
  need cache control every time cpu and dma access the shared buffer:
 cache
  clean when cpu goes to dma and cache invalidate when dma goes to cpu.
 That
  is why I added new interfaces, DMA_BUF_GET_FENCE and DMA_BUF_PUT_FENCE,
 to
  dma buf framework. Of course, Those are ugly so we need a better way: I
 just
  wanted to show you that in such way, we can synchronize the shared
 buffer
  between cpu and dma. By any chance, is there any way that kernel can be
  aware of when cpu accesses the shared buffer or is there any point I
 didn't
  catch up?
 
 Well dma_buf_map/unmap_attachment should also flush/invalidate any caches,
 and with the current dma_buf spec those two functions are the _only_ means

I know that dma buf exporter should make sure that cache clean/invalidate
are done when dma_buf_map/unmap_attachement is called. For this, already we
do so. However, this function is called when dma buf import is requested by
user to map a dmabuf fd with dma. This means that dma_buf_map_attachement()
is called ONCE when user wants to share the dmabuf fd with dma. Actually, in
case of exynos drm, dma_map_sg_attrs(), performing cache operation
internally, is called when dmabuf import is requested by user.

 you have to do so. Which strictly means that if you interleave device dma
 and cpu acccess you need to unmap/map every time.
 
 Which is obviously horribly inefficient, but a known gap in the dma_buf

Right, and also this has big overhead.

 interface. Hence why I proposed to add dma_buf_sync_attachment similar to
 dma_sync_single_for_device:
 
 /**
  * dma_buf_sync_sg_attachment - sync caches for dma access
  * @attach: dma-buf attachment to sync
  * @sgt: the sg table to sync (returned by dma_buf_map_attachement)
  * @direction: dma direction to sync for
  *
  * This function synchronizes caches for device dma through the given
  * dma-buf attachment when interleaving dma from different devices and the
  * cpu. Other device dma needs to be synced also by calls to this
  * function (or a pair of dma_buf_map/unmap_attachment calls), cpu access
  * needs to be synced with dma_buf_begin/end_cpu_access.
  */
 void dma_buf_sync_sg_attachment(struct dma_buf_attachment *attach,
   struct sg_table *sgt,
   enum dma_data_direction direction)
 
 Note that sync here only means to synchronize caches, not wait for any
 outstanding fences. This is simply to be consistent with the established
 lingo of the dma api. How the dma-buf fences fit into this is imo a
 different topic, but my idea is that all the cache coherency barriers
 (i.e. dma_buf_map/unmap_attachment, dma_buf_sync_sg_attachment and
 dma_buf_begin/end_cpu_access) would automatically block for any attached
 fences (i.e. block for write fences when doing read-only access, block for
 all fences otherwise).

As I mentioned already, kernel can't aware of when cpu accesses a shared
buffer: user can access a shared buffer after mmap anytime and the shared
buffer should be synchronized between cpu and dma. Therefore, the above
cache coherency barriers should be called every time cpu and dma tries to
access a shared buffer, checking before and after of cpu and dma access. And
exactly, the proposed way do such thing. For this, you can refer to the
below link,

http://www.mail-archive.com/linux-media@vger.kernel.org/msg62124.html

My point is that how kernel can aware of when those cache coherency barriers
should be called to synchronize caches and buffer access between cpu and
dma.

 
 Then we could do a new dma_buf_attach_flags interface for special cases
 (might also be useful for other things, similar to the recently added
 flags in the dma

Re: Introduce a new helper framework for buffer synchronization

2013-05-20 Thread Daniel Vetter
On Sat, May 18, 2013 at 03:47:43PM +0900, Inki Dae wrote:
 2013/5/15 Rob Clark robdcl...@gmail.com
 
  On Wed, May 15, 2013 at 1:19 AM, Inki Dae inki@samsung.com wrote:
  
  
   -Original Message-
   From: Rob Clark [mailto:robdcl...@gmail.com]
   Sent: Tuesday, May 14, 2013 10:39 PM
   To: Inki Dae
   Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
   Cho; linux-arm-ker...@lists.infradead.org; linux-media@vger.kernel.org
   Subject: Re: Introduce a new helper framework for buffer synchronization
  
   On Mon, May 13, 2013 at 10:52 PM, Inki Dae inki@samsung.com
  wrote:
well, for cache management, I think it is a better idea.. I didn't
really catch that this was the motivation from the initial patch, but
maybe I read it too quickly.  But cache can be decoupled from
synchronization, because CPU access is not asynchronous.  For
userspace/CPU access to buffer, you should:
   
  1) wait for buffer
  2) prepare-access
  3)  ... do whatever cpu access to buffer ...
  4) finish-access
  5) submit buffer for new dma-operation
   
   
   
For data flow from CPU to DMA device,
1) wait for buffer
2) prepare-access (dma_buf_begin_cpu_access)
3) cpu access to buffer
   
   
For data flow from DMA device to CPU
1) wait for buffer
  
   Right, but CPU access isn't asynchronous (from the point of view of
   the CPU), so there isn't really any wait step at this point.  And if
   you do want the CPU to be able to signal a fence from userspace for
   some reason, you probably what something file/fd based so the
   refcnting/cleanup when process dies doesn't leave some pending DMA
   action wedged.  But I don't really see the point of that complexity
   when the CPU access isn't asynchronous in the first place.
  
  
   There was my missing comments, please see the below sequence.
  
   For data flow from CPU to DMA device and then from DMA device to CPU,
   1) wait for buffer - at user side - ioctl(fd, DMA_BUF_GET_FENCE, ...)
   - including prepare-access (dma_buf_begin_cpu_access)
   2) cpu access to buffer
   3) wait for buffer - at device driver
   - but CPU is already accessing the buffer so blocked.
   4) signal - at user side - ioctl(fd, DMA_BUF_PUT_FENCE, ...)
   5) the thread, blocked at 3), is waked up by 4).
   - and then finish-access (dma_buf_end_cpu_access)
 
  right, I understand you can have background threads, etc, in
  userspace.  But there are already plenty of synchronization primitives
  that can be used for cpu-cpu synchronization, either within the same
  process or between multiple processes.  For cpu access, even if it is
  handled by background threads/processes, I think it is better to use
  the traditional pthreads or unix synchronization primitives.  They
  have existed forever, they are well tested, and they work.
 
  So while it seems nice and orthogonal/clean to couple cache and
  synchronization and handle dma-cpu and cpu-cpu and cpu-dma in the
  same generic way, but I think in practice we have to make things more
  complex than they otherwise need to be to do this.  Otherwise I think
  we'll be having problems with badly behaved or crashing userspace.
 
 
 Right, this is very important. I think it's not esay to solve this
 issue. Aand at least for ARM based embedded system, such feature is useful
 to us; coupling cache operation and buffer synchronization. I'd like to
 collect other opinions and advices to solve this issue.

Maybe we have a bit a misunderstanding here. The kernel really should take
care of sync and cache coherency, and I agree that with the current
dma_buf code (and the proposed fences) that part is still a bit hazy. But
the kernel should not allow userspace to block access to a buffer until
userspace is done. It should only sync with any oustanding fences and
flush buffers before that userspace access happens.

Then the kernel would again flush caches on the next dma access (which
hopefully is only started once userspace completed access). Atm this isn't
possible in an efficient way since the dma_buf api only exposes map/unmap
attachment and not a function to just sync an existing mapping like
dma_sync_single_for_device. I guess we should add a
dma_buf_sync_attachment interface for that - we already have range-based
cpu sync support with the begin/end_cpu_access interfaces.

Yours, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Introduce a new helper framework for buffer synchronization

2013-05-20 Thread Daniel Vetter
On Mon, May 20, 2013 at 11:13:04PM +0200, Daniel Vetter wrote:
 On Sat, May 18, 2013 at 03:47:43PM +0900, Inki Dae wrote:
  2013/5/15 Rob Clark robdcl...@gmail.com
  
   On Wed, May 15, 2013 at 1:19 AM, Inki Dae inki@samsung.com wrote:
   
   
-Original Message-
From: Rob Clark [mailto:robdcl...@gmail.com]
Sent: Tuesday, May 14, 2013 10:39 PM
To: Inki Dae
Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; 
YoungJun
Cho; linux-arm-ker...@lists.infradead.org; linux-media@vger.kernel.org
Subject: Re: Introduce a new helper framework for buffer 
synchronization
   
On Mon, May 13, 2013 at 10:52 PM, Inki Dae inki@samsung.com
   wrote:
 well, for cache management, I think it is a better idea.. I didn't
 really catch that this was the motivation from the initial patch, 
 but
 maybe I read it too quickly.  But cache can be decoupled from
 synchronization, because CPU access is not asynchronous.  For
 userspace/CPU access to buffer, you should:

   1) wait for buffer
   2) prepare-access
   3)  ... do whatever cpu access to buffer ...
   4) finish-access
   5) submit buffer for new dma-operation



 For data flow from CPU to DMA device,
 1) wait for buffer
 2) prepare-access (dma_buf_begin_cpu_access)
 3) cpu access to buffer


 For data flow from DMA device to CPU
 1) wait for buffer
   
Right, but CPU access isn't asynchronous (from the point of view of
the CPU), so there isn't really any wait step at this point.  And if
you do want the CPU to be able to signal a fence from userspace for
some reason, you probably what something file/fd based so the
refcnting/cleanup when process dies doesn't leave some pending DMA
action wedged.  But I don't really see the point of that complexity
when the CPU access isn't asynchronous in the first place.
   
   
There was my missing comments, please see the below sequence.
   
For data flow from CPU to DMA device and then from DMA device to CPU,
1) wait for buffer - at user side - ioctl(fd, DMA_BUF_GET_FENCE, ...)
- including prepare-access (dma_buf_begin_cpu_access)
2) cpu access to buffer
3) wait for buffer - at device driver
- but CPU is already accessing the buffer so blocked.
4) signal - at user side - ioctl(fd, DMA_BUF_PUT_FENCE, ...)
5) the thread, blocked at 3), is waked up by 4).
- and then finish-access (dma_buf_end_cpu_access)
  
   right, I understand you can have background threads, etc, in
   userspace.  But there are already plenty of synchronization primitives
   that can be used for cpu-cpu synchronization, either within the same
   process or between multiple processes.  For cpu access, even if it is
   handled by background threads/processes, I think it is better to use
   the traditional pthreads or unix synchronization primitives.  They
   have existed forever, they are well tested, and they work.
  
   So while it seems nice and orthogonal/clean to couple cache and
   synchronization and handle dma-cpu and cpu-cpu and cpu-dma in the
   same generic way, but I think in practice we have to make things more
   complex than they otherwise need to be to do this.  Otherwise I think
   we'll be having problems with badly behaved or crashing userspace.
  
  
  Right, this is very important. I think it's not esay to solve this
  issue. Aand at least for ARM based embedded system, such feature is useful
  to us; coupling cache operation and buffer synchronization. I'd like to
  collect other opinions and advices to solve this issue.
 
 Maybe we have a bit a misunderstanding here. The kernel really should take
 care of sync and cache coherency, and I agree that with the current
 dma_buf code (and the proposed fences) that part is still a bit hazy. But
 the kernel should not allow userspace to block access to a buffer until
 userspace is done. It should only sync with any oustanding fences and
 flush buffers before that userspace access happens.
 
 Then the kernel would again flush caches on the next dma access (which
 hopefully is only started once userspace completed access). Atm this isn't
 possible in an efficient way since the dma_buf api only exposes map/unmap
 attachment and not a function to just sync an existing mapping like
 dma_sync_single_for_device. I guess we should add a
 dma_buf_sync_attachment interface for that - we already have range-based
 cpu sync support with the begin/end_cpu_access interfaces.

I think my mail here was a bit unclear, so let me try to rephrase.
Re-reading through part of this discussion I think you raise some good
shortcomings of the current dma_buf interfaces and the proposed fence
patches. But I think we should address the individual pieces bit-by-bit.
On a quick survey I see a few parts to what you're trying to solve:

- More efficient cache coherency management. I think we

Re: Introduce a new helper framework for buffer synchronization

2013-05-16 Thread Daniel Vetter
On Wed, May 15, 2013 at 4:06 PM, Rob Clark robdcl...@gmail.com wrote:
 So while it seems nice and orthogonal/clean to couple cache and
 synchronization and handle dma-cpu and cpu-cpu and cpu-dma in the
 same generic way, but I think in practice we have to make things more
 complex than they otherwise need to be to do this.  Otherwise I think
 we'll be having problems with badly behaved or crashing userspace.

I haven't read through the entire thread careful but imo this is very
important. If we add a fence interface which allows userspace to block
dma this is a no-go. The only thing we need is to sync up with all
outstanding dma operations and flush caches for cpu access. If broken
userspace starts to issue new dma (or multiple thread stomp onto each
another) that's not a problem dma fences/syncpoints should try to
solve. This way we can concentrate on solving the (already
challenging) device-to-device sync issues without additional
complexities which cpu-cpu sync would impose.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Introduce a new helper framework for buffer synchronization

2013-05-15 Thread Rob Clark
On Wed, May 15, 2013 at 1:19 AM, Inki Dae inki@samsung.com wrote:


 -Original Message-
 From: Rob Clark [mailto:robdcl...@gmail.com]
 Sent: Tuesday, May 14, 2013 10:39 PM
 To: Inki Dae
 Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
 Cho; linux-arm-ker...@lists.infradead.org; linux-media@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer synchronization

 On Mon, May 13, 2013 at 10:52 PM, Inki Dae inki@samsung.com wrote:
  well, for cache management, I think it is a better idea.. I didn't
  really catch that this was the motivation from the initial patch, but
  maybe I read it too quickly.  But cache can be decoupled from
  synchronization, because CPU access is not asynchronous.  For
  userspace/CPU access to buffer, you should:
 
1) wait for buffer
2) prepare-access
3)  ... do whatever cpu access to buffer ...
4) finish-access
5) submit buffer for new dma-operation
 
 
 
  For data flow from CPU to DMA device,
  1) wait for buffer
  2) prepare-access (dma_buf_begin_cpu_access)
  3) cpu access to buffer
 
 
  For data flow from DMA device to CPU
  1) wait for buffer

 Right, but CPU access isn't asynchronous (from the point of view of
 the CPU), so there isn't really any wait step at this point.  And if
 you do want the CPU to be able to signal a fence from userspace for
 some reason, you probably what something file/fd based so the
 refcnting/cleanup when process dies doesn't leave some pending DMA
 action wedged.  But I don't really see the point of that complexity
 when the CPU access isn't asynchronous in the first place.


 There was my missing comments, please see the below sequence.

 For data flow from CPU to DMA device and then from DMA device to CPU,
 1) wait for buffer - at user side - ioctl(fd, DMA_BUF_GET_FENCE, ...)
 - including prepare-access (dma_buf_begin_cpu_access)
 2) cpu access to buffer
 3) wait for buffer - at device driver
 - but CPU is already accessing the buffer so blocked.
 4) signal - at user side - ioctl(fd, DMA_BUF_PUT_FENCE, ...)
 5) the thread, blocked at 3), is waked up by 4).
 - and then finish-access (dma_buf_end_cpu_access)

right, I understand you can have background threads, etc, in
userspace.  But there are already plenty of synchronization primitives
that can be used for cpu-cpu synchronization, either within the same
process or between multiple processes.  For cpu access, even if it is
handled by background threads/processes, I think it is better to use
the traditional pthreads or unix synchronization primitives.  They
have existed forever, they are well tested, and they work.

So while it seems nice and orthogonal/clean to couple cache and
synchronization and handle dma-cpu and cpu-cpu and cpu-dma in the
same generic way, but I think in practice we have to make things more
complex than they otherwise need to be to do this.  Otherwise I think
we'll be having problems with badly behaved or crashing userspace.

BR,
-R

 6) dma access to buffer
 7) wait for buffer - at user side - ioctl(fd, DMA_BUF_GET_FENCE, ...)
 - but DMA is already accessing the buffer so blocked.
 8) signal - at device driver
 9) the thread, blocked at 7), is waked up by 8)
 - and then prepare-access (dma_buf_begin_cpu_access)
 10 cpu access to buffer

 Basically, 'wait for buffer' includes buffer synchronization, committing
 processing, and cache operation. The buffer synchronization means that a
 current thread should wait for other threads accessing a shared buffer until
 the completion of their access. And the committing processing means that a
 current thread possesses the shared buffer so any trying to access the
 shared buffer by another thread makes the thread to be blocked. However, as
 I already mentioned before, it seems that these user interfaces are so ugly
 yet. So we need better way.

 Give me more comments if there is my missing point :)

 Thanks,
 Inki Dae

 BR,
 -R


  2) finish-access (dma_buf_end _cpu_access)
  3) dma access to buffer
 
  1) and 2) are coupled with one function: we have implemented
  fence_helper_commit_reserve() for it.
 
  Cache control(cache clean or cache invalidate) is performed properly
  checking previous access type and current access type.
  And the below is actual codes for it,

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Introduce a new helper framework for buffer synchronization

2013-05-14 Thread Rob Clark
On Mon, May 13, 2013 at 10:52 PM, Inki Dae inki@samsung.com wrote:
 well, for cache management, I think it is a better idea.. I didn't
 really catch that this was the motivation from the initial patch, but
 maybe I read it too quickly.  But cache can be decoupled from
 synchronization, because CPU access is not asynchronous.  For
 userspace/CPU access to buffer, you should:

   1) wait for buffer
   2) prepare-access
   3)  ... do whatever cpu access to buffer ...
   4) finish-access
   5) submit buffer for new dma-operation



 For data flow from CPU to DMA device,
 1) wait for buffer
 2) prepare-access (dma_buf_begin_cpu_access)
 3) cpu access to buffer


 For data flow from DMA device to CPU
 1) wait for buffer

Right, but CPU access isn't asynchronous (from the point of view of
the CPU), so there isn't really any wait step at this point.  And if
you do want the CPU to be able to signal a fence from userspace for
some reason, you probably what something file/fd based so the
refcnting/cleanup when process dies doesn't leave some pending DMA
action wedged.  But I don't really see the point of that complexity
when the CPU access isn't asynchronous in the first place.

BR,
-R


 2) finish-access (dma_buf_end _cpu_access)
 3) dma access to buffer

 1) and 2) are coupled with one function: we have implemented
 fence_helper_commit_reserve() for it.

 Cache control(cache clean or cache invalidate) is performed properly
 checking previous access type and current access type.
 And the below is actual codes for it,
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Introduce a new helper framework for buffer synchronization

2013-05-14 Thread Inki Dae


 -Original Message-
 From: Rob Clark [mailto:robdcl...@gmail.com]
 Sent: Tuesday, May 14, 2013 10:39 PM
 To: Inki Dae
 Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
 Cho; linux-arm-ker...@lists.infradead.org; linux-media@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer synchronization
 
 On Mon, May 13, 2013 at 10:52 PM, Inki Dae inki@samsung.com wrote:
  well, for cache management, I think it is a better idea.. I didn't
  really catch that this was the motivation from the initial patch, but
  maybe I read it too quickly.  But cache can be decoupled from
  synchronization, because CPU access is not asynchronous.  For
  userspace/CPU access to buffer, you should:
 
1) wait for buffer
2) prepare-access
3)  ... do whatever cpu access to buffer ...
4) finish-access
5) submit buffer for new dma-operation
 
 
 
  For data flow from CPU to DMA device,
  1) wait for buffer
  2) prepare-access (dma_buf_begin_cpu_access)
  3) cpu access to buffer
 
 
  For data flow from DMA device to CPU
  1) wait for buffer
 
 Right, but CPU access isn't asynchronous (from the point of view of
 the CPU), so there isn't really any wait step at this point.  And if
 you do want the CPU to be able to signal a fence from userspace for
 some reason, you probably what something file/fd based so the
 refcnting/cleanup when process dies doesn't leave some pending DMA
 action wedged.  But I don't really see the point of that complexity
 when the CPU access isn't asynchronous in the first place.


There was my missing comments, please see the below sequence.

For data flow from CPU to DMA device and then from DMA device to CPU,
1) wait for buffer - at user side - ioctl(fd, DMA_BUF_GET_FENCE, ...)
- including prepare-access (dma_buf_begin_cpu_access)
2) cpu access to buffer
3) wait for buffer - at device driver
- but CPU is already accessing the buffer so blocked.
4) signal - at user side - ioctl(fd, DMA_BUF_PUT_FENCE, ...)
5) the thread, blocked at 3), is waked up by 4).
- and then finish-access (dma_buf_end_cpu_access)
6) dma access to buffer
7) wait for buffer - at user side - ioctl(fd, DMA_BUF_GET_FENCE, ...)
- but DMA is already accessing the buffer so blocked.
8) signal - at device driver
9) the thread, blocked at 7), is waked up by 8)
- and then prepare-access (dma_buf_begin_cpu_access)
10 cpu access to buffer

Basically, 'wait for buffer' includes buffer synchronization, committing
processing, and cache operation. The buffer synchronization means that a
current thread should wait for other threads accessing a shared buffer until
the completion of their access. And the committing processing means that a
current thread possesses the shared buffer so any trying to access the
shared buffer by another thread makes the thread to be blocked. However, as
I already mentioned before, it seems that these user interfaces are so ugly
yet. So we need better way.

Give me more comments if there is my missing point :)

Thanks,
Inki Dae

 BR,
 -R
 
 
  2) finish-access (dma_buf_end _cpu_access)
  3) dma access to buffer
 
  1) and 2) are coupled with one function: we have implemented
  fence_helper_commit_reserve() for it.
 
  Cache control(cache clean or cache invalidate) is performed properly
  checking previous access type and current access type.
  And the below is actual codes for it,

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Introduce a new helper framework for buffer synchronization

2013-05-13 Thread Maarten Lankhorst
Op 09-05-13 09:33, Inki Dae schreef:
 Hi all,

 This post introduces a new helper framework based on dma fence. And the
 purpose of this post is to collect other opinions and advices before RFC
 posting.

 First of all, this helper framework, called fence helper, is in progress
 yet so might not have enough comments in codes and also might need to be
 more cleaned up. Moreover, we might be missing some parts of the dma fence.
 However, I'd like to say that all things mentioned below has been tested
 with Linux platform and worked well.

 

 And tutorial for user process.
 just before cpu access
 struct dma_buf_fence *df;

 df-type = DMA_BUF_ACCESS_READ or DMA_BUF_ACCESS_WRITE;
 ioctl(fd, DMA_BUF_GET_FENCE, df);

 after memset or memcpy
 ioctl(fd, DMA_BUF_PUT_FENCE, df);
NAK.

Userspace doesn't need to trigger fences. It can do a buffer idle wait, and 
postpone submitting new commands until after it's done using the buffer.
Kernel space doesn't need the root hole you created by giving a dereferencing a 
pointer passed from userspace.
Your next exercise should be to write a security exploit from the api you 
created here. It's the only way to learn how to write safe code. Hint: df.ctx = 
mmap(..);

~Maarten
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Introduce a new helper framework for buffer synchronization

2013-05-13 Thread Inki Dae


 -Original Message-
 From: Maarten Lankhorst [mailto:maarten.lankho...@canonical.com]
 Sent: Monday, May 13, 2013 5:01 PM
 To: Inki Dae
 Cc: Rob Clark; Daniel Vetter; DRI mailing list; linux-arm-
 ker...@lists.infradead.org; linux-media@vger.kernel.org; linux-fbdev;
 Kyungmin Park; myungjoo.ham; YoungJun Cho
 Subject: Re: Introduce a new helper framework for buffer synchronization
 
 Op 09-05-13 09:33, Inki Dae schreef:
  Hi all,
 
  This post introduces a new helper framework based on dma fence. And the
  purpose of this post is to collect other opinions and advices before RFC
  posting.
 
  First of all, this helper framework, called fence helper, is in progress
  yet so might not have enough comments in codes and also might need to be
  more cleaned up. Moreover, we might be missing some parts of the dma
 fence.
  However, I'd like to say that all things mentioned below has been tested
  with Linux platform and worked well.
 
  
 
  And tutorial for user process.
  just before cpu access
  struct dma_buf_fence *df;
 
  df-type = DMA_BUF_ACCESS_READ or DMA_BUF_ACCESS_WRITE;
  ioctl(fd, DMA_BUF_GET_FENCE, df);
 
  after memset or memcpy
  ioctl(fd, DMA_BUF_PUT_FENCE, df);
 NAK.
 
 Userspace doesn't need to trigger fences. It can do a buffer idle wait,
 and postpone submitting new commands until after it's done using the
 buffer.

Hi Maarten,

It seems that you say user should wait for a buffer like KDS does: KDS uses
select() to postpone submitting new commands. But I think this way assumes
that every data flows a DMA device to a CPU. For example, a CPU should keep
polling for the completion of a buffer access by a DMA device. This means
that the this way isn't considered for data flow to opposite case; CPU to
DMA device.

 Kernel space doesn't need the root hole you created by giving a
 dereferencing a pointer passed from userspace.
 Your next exercise should be to write a security exploit from the api you
 created here. It's the only way to learn how to write safe code. Hint:
 df.ctx = mmap(..);
 

Also I'm not clear to use our way yet and that is why I posted. As you
mentioned, it seems like that using mmap() is more safe. But there is one
issue it makes me confusing. For your hint, df.ctx = mmap(..), the issue is
that dmabuf mmap can be used to map a dmabuf with user space. And the dmabuf
means a physical memory region allocated by some allocator such as drm gem
or ion.

There might be my missing point so could you please give me more comments?

Thanks,
Inki Dae



 ~Maarten

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Introduce a new helper framework for buffer synchronization

2013-05-13 Thread Maarten Lankhorst
Op 13-05-13 11:21, Inki Dae schreef:

 -Original Message-
 From: Maarten Lankhorst [mailto:maarten.lankho...@canonical.com]
 Sent: Monday, May 13, 2013 5:01 PM
 To: Inki Dae
 Cc: Rob Clark; Daniel Vetter; DRI mailing list; linux-arm-
 ker...@lists.infradead.org; linux-media@vger.kernel.org; linux-fbdev;
 Kyungmin Park; myungjoo.ham; YoungJun Cho
 Subject: Re: Introduce a new helper framework for buffer synchronization

 Op 09-05-13 09:33, Inki Dae schreef:
 Hi all,

 This post introduces a new helper framework based on dma fence. And the
 purpose of this post is to collect other opinions and advices before RFC
 posting.

 First of all, this helper framework, called fence helper, is in progress
 yet so might not have enough comments in codes and also might need to be
 more cleaned up. Moreover, we might be missing some parts of the dma
 fence.
 However, I'd like to say that all things mentioned below has been tested
 with Linux platform and worked well.
 

 And tutorial for user process.
 just before cpu access
 struct dma_buf_fence *df;

 df-type = DMA_BUF_ACCESS_READ or DMA_BUF_ACCESS_WRITE;
 ioctl(fd, DMA_BUF_GET_FENCE, df);

 after memset or memcpy
 ioctl(fd, DMA_BUF_PUT_FENCE, df);
 NAK.

 Userspace doesn't need to trigger fences. It can do a buffer idle wait,
 and postpone submitting new commands until after it's done using the
 buffer.
 Hi Maarten,

 It seems that you say user should wait for a buffer like KDS does: KDS uses
 select() to postpone submitting new commands. But I think this way assumes
 that every data flows a DMA device to a CPU. For example, a CPU should keep
 polling for the completion of a buffer access by a DMA device. This means
 that the this way isn't considered for data flow to opposite case; CPU to
 DMA device.
Not really. You do both things the same way. You first wait for the bo to be 
idle, this could be implemented by adding poll support to the dma-buf fd.
Then you either do your read or write. Since userspace is supposed to be the 
one controlling the bo it should stay idle at that point. If you have another 
thread queueing
the buffer againbefore your thread is done that's a bug in the application, and 
can be solved with userspace locking primitives. No need for the kernel to get 
involved.

 Kernel space doesn't need the root hole you created by giving a
 dereferencing a pointer passed from userspace.
 Your next exercise should be to write a security exploit from the api you
 created here. It's the only way to learn how to write safe code. Hint:
 df.ctx = mmap(..);

 Also I'm not clear to use our way yet and that is why I posted. As you
 mentioned, it seems like that using mmap() is more safe. But there is one
 issue it makes me confusing. For your hint, df.ctx = mmap(..), the issue is
 that dmabuf mmap can be used to map a dmabuf with user space. And the dmabuf
 means a physical memory region allocated by some allocator such as drm gem
 or ion.

 There might be my missing point so could you please give me more comments?

My point was that userspace could change df.ctx to some mmap'd memory, forcing 
the kernel to execute some code prepared by userspace.
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Introduce a new helper framework for buffer synchronization

2013-05-13 Thread Inki Dae


 -Original Message-
 From: Maarten Lankhorst [mailto:maarten.lankho...@canonical.com]
 Sent: Monday, May 13, 2013 6:52 PM
 To: Inki Dae
 Cc: 'Rob Clark'; 'Daniel Vetter'; 'DRI mailing list'; linux-arm-
 ker...@lists.infradead.org; linux-media@vger.kernel.org; 'linux-fbdev';
 'Kyungmin Park'; 'myungjoo.ham'; 'YoungJun Cho'
 Subject: Re: Introduce a new helper framework for buffer synchronization
 
 Op 13-05-13 11:21, Inki Dae schreef:
 
  -Original Message-
  From: Maarten Lankhorst [mailto:maarten.lankho...@canonical.com]
  Sent: Monday, May 13, 2013 5:01 PM
  To: Inki Dae
  Cc: Rob Clark; Daniel Vetter; DRI mailing list; linux-arm-
  ker...@lists.infradead.org; linux-media@vger.kernel.org; linux-fbdev;
  Kyungmin Park; myungjoo.ham; YoungJun Cho
  Subject: Re: Introduce a new helper framework for buffer
 synchronization
 
  Op 09-05-13 09:33, Inki Dae schreef:
  Hi all,
 
  This post introduces a new helper framework based on dma fence. And
 the
  purpose of this post is to collect other opinions and advices before
 RFC
  posting.
 
  First of all, this helper framework, called fence helper, is in
 progress
  yet so might not have enough comments in codes and also might need to
 be
  more cleaned up. Moreover, we might be missing some parts of the dma
  fence.
  However, I'd like to say that all things mentioned below has been
 tested
  with Linux platform and worked well.
  
 
  And tutorial for user process.
  just before cpu access
  struct dma_buf_fence *df;
 
  df-type = DMA_BUF_ACCESS_READ or
DMA_BUF_ACCESS_WRITE;
  ioctl(fd, DMA_BUF_GET_FENCE, df);
 
  after memset or memcpy
  ioctl(fd, DMA_BUF_PUT_FENCE, df);
  NAK.
 
  Userspace doesn't need to trigger fences. It can do a buffer idle wait,
  and postpone submitting new commands until after it's done using the
  buffer.
  Hi Maarten,
 
  It seems that you say user should wait for a buffer like KDS does: KDS
 uses
  select() to postpone submitting new commands. But I think this way
 assumes
  that every data flows a DMA device to a CPU. For example, a CPU should
 keep
  polling for the completion of a buffer access by a DMA device. This
 means
  that the this way isn't considered for data flow to opposite case; CPU
 to
  DMA device.
 Not really. You do both things the same way. You first wait for the bo to
 be idle, this could be implemented by adding poll support to the dma-buf
 fd.
 Then you either do your read or write. Since userspace is supposed to be
 the one controlling the bo it should stay idle at that point. If you have
 another thread queueing
 the buffer againbefore your thread is done that's a bug in the
application,
 and can be solved with userspace locking primitives. No need for the
 kernel to get involved.
 

Yes, that is how we have synchronized buffer between CPU and DMA device
until now without buffer synchronization mechanism. I thought that it's best
to make user not considering anything: user can access a buffer regardless
of any DMA device controlling and the buffer synchronization is performed in
kernel level. Moreover, I think we could optimize graphics and multimedia
hardware performance because hardware can do more works: one thread accesses
a shared buffer and the other controls DMA device with the shared buffer in
parallel. Thus, we could avoid sequential processing and that is my
intention. Aren't you think about that we could improve hardware utilization
with such way or other? of course, there could be better way.

  Kernel space doesn't need the root hole you created by giving a
  dereferencing a pointer passed from userspace.
  Your next exercise should be to write a security exploit from the api
 you
  created here. It's the only way to learn how to write safe code. Hint:
  df.ctx = mmap(..);
 
  Also I'm not clear to use our way yet and that is why I posted. As you
  mentioned, it seems like that using mmap() is more safe. But there is
 one
  issue it makes me confusing. For your hint, df.ctx = mmap(..), the issue
 is
  that dmabuf mmap can be used to map a dmabuf with user space. And the
 dmabuf
  means a physical memory region allocated by some allocator such as drm
 gem
  or ion.
 
  There might be my missing point so could you please give me more
 comments?
 
 My point was that userspace could change df.ctx to some mmap'd memory,
 forcing the kernel to execute some code prepared by userspace.

Understood. I have to find a better way. And for this, I'd like to listen
attentively more opinions and advices.

Thanks for comments,
Inki Dae

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Introduce a new helper framework for buffer synchronization

2013-05-13 Thread Maarten Lankhorst
Op 13-05-13 13:24, Inki Dae schreef:
 and can be solved with userspace locking primitives. No need for the
 kernel to get involved.

 Yes, that is how we have synchronized buffer between CPU and DMA device
 until now without buffer synchronization mechanism. I thought that it's best
 to make user not considering anything: user can access a buffer regardless
 of any DMA device controlling and the buffer synchronization is performed in
 kernel level. Moreover, I think we could optimize graphics and multimedia
 hardware performance because hardware can do more works: one thread accesses
 a shared buffer and the other controls DMA device with the shared buffer in
 parallel. Thus, we could avoid sequential processing and that is my
 intention. Aren't you think about that we could improve hardware utilization
 with such way or other? of course, there could be better way.

If you don't want to block the hardware the only option is to allocate a temp 
bo and blit to/from it using the hardware.
OpenGL already does that when you want to read back, because otherwise the 
entire pipeline can get stalled.
The overhead of command submission for that shouldn't be high, if it is you 
should really try to optimize that first
before coming up with this crazy scheme.

In that case you still wouldn't give userspace control over the fences. I don't 
see any way that can end well.
What if userspace never signals? What if userspace gets killed by oom killer. 
Who keeps track of that?

~Maarten
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Introduce a new helper framework for buffer synchronization

2013-05-13 Thread Inki Dae


 -Original Message-
 From: linux-fbdev-ow...@vger.kernel.org [mailto:linux-fbdev-
 ow...@vger.kernel.org] On Behalf Of Maarten Lankhorst
 Sent: Monday, May 13, 2013 8:41 PM
 To: Inki Dae
 Cc: 'Rob Clark'; 'Daniel Vetter'; 'DRI mailing list'; linux-arm-
 ker...@lists.infradead.org; linux-media@vger.kernel.org; 'linux-fbdev';
 'Kyungmin Park'; 'myungjoo.ham'; 'YoungJun Cho'
 Subject: Re: Introduce a new helper framework for buffer synchronization
 
 Op 13-05-13 13:24, Inki Dae schreef:
  and can be solved with userspace locking primitives. No need for the
  kernel to get involved.
 
  Yes, that is how we have synchronized buffer between CPU and DMA device
  until now without buffer synchronization mechanism. I thought that it's
 best
  to make user not considering anything: user can access a buffer
 regardless
  of any DMA device controlling and the buffer synchronization is
 performed in
  kernel level. Moreover, I think we could optimize graphics and
 multimedia
  hardware performance because hardware can do more works: one thread
 accesses
  a shared buffer and the other controls DMA device with the shared buffer
 in
  parallel. Thus, we could avoid sequential processing and that is my
  intention. Aren't you think about that we could improve hardware
 utilization
  with such way or other? of course, there could be better way.
 
 If you don't want to block the hardware the only option is to allocate a
 temp bo and blit to/from it using the hardware.
 OpenGL already does that when you want to read back, because otherwise the
 entire pipeline can get stalled.
 The overhead of command submission for that shouldn't be high, if it is
 you should really try to optimize that first
 before coming up with this crazy scheme.
 

I have considered all devices sharing buffer with CPU; multimedia, display
controller and graphics devices (including GPU). And we could provide
easy-to-use user interfaces for buffer synchronization. Of course, the
proposed user interfaces may be so ugly yet but at least, I think we need
something else for it.

 In that case you still wouldn't give userspace control over the fences. I
 don't see any way that can end well.
 What if userspace never signals? What if userspace gets killed by oom
 killer. Who keeps track of that?
 

In all cases, all kernel resources to user fence will be released by kernel
once the fence is timed out: never signaling and process killing by oom
killer makes the fence timed out. And if we use mmap mechanism you mentioned
before, I think user resource could also be freed properly.

Thanks,
Inki Dae

 ~Maarten
 --
 To unsubscribe from this list: send the line unsubscribe linux-fbdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Introduce a new helper framework for buffer synchronization

2013-05-13 Thread Rob Clark
On Mon, May 13, 2013 at 8:21 AM, Inki Dae inki@samsung.com wrote:

 In that case you still wouldn't give userspace control over the fences. I
 don't see any way that can end well.
 What if userspace never signals? What if userspace gets killed by oom
 killer. Who keeps track of that?


 In all cases, all kernel resources to user fence will be released by kernel
 once the fence is timed out: never signaling and process killing by oom
 killer makes the fence timed out. And if we use mmap mechanism you mentioned
 before, I think user resource could also be freed properly.


I tend to agree w/ Maarten here.. there is no good reason for
userspace to be *signaling* fences.  The exception might be some blob
gpu drivers which don't have enough knowledge in the kernel to figure
out what to do.  (In which case you can add driver private ioctls for
that.. still not the right thing to do but at least you don't make a
public API out of it.)

BR,
-R
--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Introduce a new helper framework for buffer synchronization

2013-05-13 Thread Rob Clark
On Mon, May 13, 2013 at 1:18 PM, Inki Dae inki@samsung.com wrote:


 2013/5/13 Rob Clark robdcl...@gmail.com

 On Mon, May 13, 2013 at 8:21 AM, Inki Dae inki@samsung.com wrote:
 
  In that case you still wouldn't give userspace control over the fences.
  I
  don't see any way that can end well.
  What if userspace never signals? What if userspace gets killed by oom
  killer. Who keeps track of that?
 
 
  In all cases, all kernel resources to user fence will be released by
  kernel
  once the fence is timed out: never signaling and process killing by oom
  killer makes the fence timed out. And if we use mmap mechanism you
  mentioned
  before, I think user resource could also be freed properly.


 I tend to agree w/ Maarten here.. there is no good reason for
 userspace to be *signaling* fences.  The exception might be some blob
 gpu drivers which don't have enough knowledge in the kernel to figure
 out what to do.  (In which case you can add driver private ioctls for
 that.. still not the right thing to do but at least you don't make a
 public API out of it.)


 Please do not care whether those are generic or not. Let's see the following
 three things. First, it's cache operation. As you know, ARM SoC has ACP
 (Accelerator Coherency Port) and can be connected to DMA engine or similar
 devices. And this port is used for cache coherency between CPU cache and DMA
 device. However, most devices on ARM based embedded systems don't use the
 ACP port. So they need proper cache operation before and after of DMA or CPU
 access in case of using cachable mapping. Actually, I see many Linux based
 platforms call cache control interfaces directly for that. I think the
 reason, they do so, is that kernel isn't aware of when and how CPU accessed
 memory.

I think we had kicked around the idea of giving dmabuf's a
prepare/finish ioctl quite some time back.  This is probably something
that should be at least a bit decoupled from fences.  (Possibly
'prepare' waits for dma access to complete, but not the other way
around.)

And I did implement in omapdrm support for simulating coherency via
page fault-in / shoot-down..  It is one option that makes it
completely transparent to userspace, although there is some
performance const, so I suppose it depends a bit on your use-case.

 And second, user process has to do so many things in case of using shared
 memory with DMA device. User process should understand how DMA device is
 operated and when interfaces for controling the DMA device are called. Such
 things would make user application so complicated.

 And third, it's performance optimization to multimedia and graphics devices.
 As I mentioned already, we should consider sequential processing for buffer
 sharing between CPU and DMA device. This means that CPU should stay with
 idle until DMA device is completed and vise versa.

 That is why I proposed such user interfaces. Of course, these interfaces
 might be so ugly yet: for this, Maarten pointed already out and I agree with
 him. But there must be another better way. Aren't you think we need similar
 thing? With such interfaces, cache control and buffer synchronization can be
 performed in kernel level. Moreover, user applization doesn't need to
 consider DMA device controlling anymore. Therefore, one thread can access a
 shared buffer and the other can control DMA device with the shared buffer in
 parallel. We can really make the best use of CPU and DMA idle time. In other
 words, we can really make the best use of multi tasking OS, Linux.

 So could you please tell me about that there is any reason we don't use
 public API for it? I think we can add and use public API if NECESSARY.

well, for cache management, I think it is a better idea.. I didn't
really catch that this was the motivation from the initial patch, but
maybe I read it too quickly.  But cache can be decoupled from
synchronization, because CPU access is not asynchronous.  For
userspace/CPU access to buffer, you should:

  1) wait for buffer
  2) prepare-access
  3)  ... do whatever cpu access to buffer ...
  4) finish-access
  5) submit buffer for new dma-operation

I suppose you could combine the syscall for #1 and #2.. not sure if
that is a good idea or not.  But you don't need to.  And there is
never really any need for userspace to signal a fence.

BR,
-R

 Thanks,
 Inki Dae


 BR,
 -R
 ___
 dri-devel mailing list
 dri-de...@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/dri-devel


--
To unsubscribe from this list: send the line unsubscribe linux-media in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Introduce a new helper framework for buffer synchronization

2013-05-13 Thread Tomasz Figa
Hi,

On Monday 13 of May 2013 20:24:01 Inki Dae wrote:
  -Original Message-
  From: Maarten Lankhorst [mailto:maarten.lankho...@canonical.com]
  Sent: Monday, May 13, 2013 6:52 PM
  To: Inki Dae
  Cc: 'Rob Clark'; 'Daniel Vetter'; 'DRI mailing list'; linux-arm-
  ker...@lists.infradead.org; linux-media@vger.kernel.org;
  'linux-fbdev';
  'Kyungmin Park'; 'myungjoo.ham'; 'YoungJun Cho'
  Subject: Re: Introduce a new helper framework for buffer
  synchronization 
  Op 13-05-13 11:21, Inki Dae schreef:
   -Original Message-
   From: Maarten Lankhorst [mailto:maarten.lankho...@canonical.com]
   Sent: Monday, May 13, 2013 5:01 PM
   To: Inki Dae
   Cc: Rob Clark; Daniel Vetter; DRI mailing list; linux-arm-
   ker...@lists.infradead.org; linux-media@vger.kernel.org;
   linux-fbdev;
   Kyungmin Park; myungjoo.ham; YoungJun Cho
   Subject: Re: Introduce a new helper framework for buffer
  
  synchronization
  
   Op 09-05-13 09:33, Inki Dae schreef:
   Hi all,
   
   This post introduces a new helper framework based on dma fence.
   And
  
  the
  
   purpose of this post is to collect other opinions and advices
   before
  
  RFC
  
   posting.
   
   First of all, this helper framework, called fence helper, is in
  
  progress
  
   yet so might not have enough comments in codes and also might need
   to
  
  be
  
   more cleaned up. Moreover, we might be missing some parts of the
   dma
   
   fence.
   
   However, I'd like to say that all things mentioned below has been
  
  tested
  
   with Linux platform and worked well.
   
   
   And tutorial for user process.
   
   just before cpu access
   
   struct dma_buf_fence *df;
   
   df-type = DMA_BUF_ACCESS_READ or
 
 DMA_BUF_ACCESS_WRITE;
 
   ioctl(fd, DMA_BUF_GET_FENCE, df);
   
   after memset or memcpy
   
   ioctl(fd, DMA_BUF_PUT_FENCE, df);
   
   NAK.
   
   Userspace doesn't need to trigger fences. It can do a buffer idle
   wait,
   and postpone submitting new commands until after it's done using
   the
   buffer.
   
   Hi Maarten,
   
   It seems that you say user should wait for a buffer like KDS does:
   KDS
  
  uses
  
   select() to postpone submitting new commands. But I think this way
  
  assumes
  
   that every data flows a DMA device to a CPU. For example, a CPU
   should
  
  keep
  
   polling for the completion of a buffer access by a DMA device. This
  
  means
  
   that the this way isn't considered for data flow to opposite case;
   CPU
  
  to
  
   DMA device.
  
  Not really. You do both things the same way. You first wait for the bo
  to be idle, this could be implemented by adding poll support to the
  dma-buf fd.
  Then you either do your read or write. Since userspace is supposed to
  be the one controlling the bo it should stay idle at that point. If
  you have another thread queueing
  the buffer againbefore your thread is done that's a bug in the
 
 application,
 
  and can be solved with userspace locking primitives. No need for the
  kernel to get involved.
 
 Yes, that is how we have synchronized buffer between CPU and DMA device
 until now without buffer synchronization mechanism. I thought that it's
 best to make user not considering anything: user can access a buffer
 regardless of any DMA device controlling and the buffer synchronization
 is performed in kernel level. Moreover, I think we could optimize
 graphics and multimedia hardware performance because hardware can do
 more works: one thread accesses a shared buffer and the other controls
 DMA device with the shared buffer in parallel.

Could you explain this point? I thought that if there is a shared buffer 
accessible for user and DMA device, only one of them can use it at the 
moment, i.e. the buffer is useless for the reading user (or read DMA) 
until (write) DMA (or writing user) finishes writing for it. Is it 
incorrect? Or this is not the point here?

I'm not an expert here and I'm just trying to understand the idea, so 
correct me if I'm wrong. Thanks in advance.

Best regards,
Tomasz

 Thus, we could avoid
 sequential processing and that is my intention. Aren't you think about
 that we could improve hardware utilization with such way or other? of
 course, there could be better way.
 
   Kernel space doesn't need the root hole you created by giving a
   dereferencing a pointer passed from userspace.
   Your next exercise should be to write a security exploit from the
   api
  
  you
  
   created here. It's the only way to learn how to write safe code.
   Hint:
   df.ctx = mmap(..);
   
   Also I'm not clear to use our way yet and that is why I posted. As
   you
   mentioned, it seems like that using mmap() is more safe. But there
   is
  
  one
  
   issue it makes me confusing. For your hint, df.ctx = mmap(..), the
   issue 
  is
  
   that dmabuf mmap can be used to map a dmabuf with user space

RE: Introduce a new helper framework for buffer synchronization

2013-05-13 Thread Inki Dae


 -Original Message-
 From: Rob Clark [mailto:robdcl...@gmail.com]
 Sent: Tuesday, May 14, 2013 2:58 AM
 To: Inki Dae
 Cc: linux-fbdev; DRI mailing list; Kyungmin Park; myungjoo.ham; YoungJun
 Cho; linux-arm-ker...@lists.infradead.org; linux-media@vger.kernel.org
 Subject: Re: Introduce a new helper framework for buffer synchronization
 
 On Mon, May 13, 2013 at 1:18 PM, Inki Dae inki@samsung.com wrote:
 
 
  2013/5/13 Rob Clark robdcl...@gmail.com
 
  On Mon, May 13, 2013 at 8:21 AM, Inki Dae inki@samsung.com wrote:
  
   In that case you still wouldn't give userspace control over the
 fences.
   I
   don't see any way that can end well.
   What if userspace never signals? What if userspace gets killed by
 oom
   killer. Who keeps track of that?
  
  
   In all cases, all kernel resources to user fence will be released by
   kernel
   once the fence is timed out: never signaling and process killing by
 oom
   killer makes the fence timed out. And if we use mmap mechanism you
   mentioned
   before, I think user resource could also be freed properly.
 
 
  I tend to agree w/ Maarten here.. there is no good reason for
  userspace to be *signaling* fences.  The exception might be some blob
  gpu drivers which don't have enough knowledge in the kernel to figure
  out what to do.  (In which case you can add driver private ioctls for
  that.. still not the right thing to do but at least you don't make a
  public API out of it.)
 
 
  Please do not care whether those are generic or not. Let's see the
 following
  three things. First, it's cache operation. As you know, ARM SoC has ACP
  (Accelerator Coherency Port) and can be connected to DMA engine or
 similar
  devices. And this port is used for cache coherency between CPU cache and
 DMA
  device. However, most devices on ARM based embedded systems don't use
 the
  ACP port. So they need proper cache operation before and after of DMA or
 CPU
  access in case of using cachable mapping. Actually, I see many Linux
 based
  platforms call cache control interfaces directly for that. I think the
  reason, they do so, is that kernel isn't aware of when and how CPU
 accessed
  memory.
 
 I think we had kicked around the idea of giving dmabuf's a
 prepare/finish ioctl quite some time back.  This is probably something
 that should be at least a bit decoupled from fences.  (Possibly
 'prepare' waits for dma access to complete, but not the other way
 around.)
 
 And I did implement in omapdrm support for simulating coherency via
 page fault-in / shoot-down..  It is one option that makes it
 completely transparent to userspace, although there is some
 performance const, so I suppose it depends a bit on your use-case.
 
  And second, user process has to do so many things in case of using
 shared
  memory with DMA device. User process should understand how DMA device is
  operated and when interfaces for controling the DMA device are called.
 Such
  things would make user application so complicated.
 
  And third, it's performance optimization to multimedia and graphics
 devices.
  As I mentioned already, we should consider sequential processing for
 buffer
  sharing between CPU and DMA device. This means that CPU should stay with
  idle until DMA device is completed and vise versa.
 
  That is why I proposed such user interfaces. Of course, these interfaces
  might be so ugly yet: for this, Maarten pointed already out and I agree
 with
  him. But there must be another better way. Aren't you think we need
 similar
  thing? With such interfaces, cache control and buffer synchronization
 can be
  performed in kernel level. Moreover, user applization doesn't need to
  consider DMA device controlling anymore. Therefore, one thread can
 access a
  shared buffer and the other can control DMA device with the shared
 buffer in
  parallel. We can really make the best use of CPU and DMA idle time. In
 other
  words, we can really make the best use of multi tasking OS, Linux.
 
  So could you please tell me about that there is any reason we don't use
  public API for it? I think we can add and use public API if NECESSARY.
 
 well, for cache management, I think it is a better idea.. I didn't
 really catch that this was the motivation from the initial patch, but
 maybe I read it too quickly.  But cache can be decoupled from
 synchronization, because CPU access is not asynchronous.  For
 userspace/CPU access to buffer, you should:
 
   1) wait for buffer
   2) prepare-access
   3)  ... do whatever cpu access to buffer ...
   4) finish-access
   5) submit buffer for new dma-operation
 


For data flow from CPU to DMA device,
1) wait for buffer
2) prepare-access (dma_buf_begin_cpu_access)
3) cpu access to buffer


For data flow from DMA device to CPU
1) wait for buffer
2) finish-access (dma_buf_end _cpu_access)
3) dma access to buffer

1) and 2) are coupled with one function: we have implemented
fence_helper_commit_reserve() for it.

Cache control(cache clean or cache