Re: [Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.

2014-05-07 Thread Gupta, Sourab
On Tue, 2014-05-06 at 13:12 +, Chris Wilson wrote:
 On Tue, May 06, 2014 at 12:59:37PM +, Gupta, Sourab wrote:
  On Tue, 2014-05-06 at 11:34 +, Chris Wilson wrote:
   On Tue, May 06, 2014 at 04:40:58PM +0530, sourab.gu...@intel.com wrote:
From: Sourab Gupta sourab.gu...@intel.com

This patch is in continuation of and is dependent on earlier patch
series to 'reduce the time for which device mutex is kept locked'.
(http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html)

This patch aims to reduce the allocation time of pages from shmem
by using blitter engines for clearing the pages freshly alloced.
This is utilized in case of fresh pages allocated in shmem_preallocate
routines in execbuffer path and page_fault path only.

Even though the CPU memset routine is optimized, but still sometimes
the time consumed in clearing the pages of a large buffer comes in
the order of milliseconds.
We intend to make this operation asynchronous by using blitter engine,
so irrespective of the size of buffer to be cleared, the execbuffer
ioctl processing time will not be affected. Use of blitter engine will
make the overall execution time of 'execbuffer' ioctl lesser for
large buffers.

There may be power implications here on using blitter engines, and
we have to evaluate this. As a next step, we can selectively enable
this HW based memset only for large buffers, where the overhead of
adding commands in a blitter ring(which will otherwise be idle),
cross ring synchronization, will be negligible compared to using the
CPU to clear out the buffer.
   
   You leave a lot of holes by which you leak the uncleared pages to
   userspace.
   -Chris
   
  Hi Chris,
  
  Are you ok with the overall design as such, and the
  shmem_read_mapping_page_gfp_noclear interface?
  Is the leak of uncleared pages happening due to implementation issues?
  If so, we'll try to mitigate them.
 
 Actually, along similar lines there is an even more fundamental issue.
 You should only clear the objects if the pages have not been
 prepopulated.
 -Chris
 
Hi Chris,
This patch is in continuation of the shmem preallocate patch sent by
Akash earlier.
(http://lists.freedesktop.org/archives/intel-gfx/2014-May/044597.html)
We employ this method only in case of the preallocate routine, which
will be called in the first page fault of the object resulting in fresh
allocation of pages.
This is controlled by means of a flag 'require_clear' which is set in
preallocate routine(which will be come into picture only in case of
fresh allocation). If pages are already populated for the object, this
won't come into picture.
Also, we'll try to fix the leak of uncleared pages due to any
implementation issues.

Regards,
Sourab

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.

2014-05-07 Thread Gupta, Sourab
On Tue, 2014-05-06 at 17:56 +, Eric Anholt wrote:
 sourab.gu...@intel.com writes:
 
  From: Sourab Gupta sourab.gu...@intel.com
 
  This patch is in continuation of and is dependent on earlier patch
  series to 'reduce the time for which device mutex is kept locked'.
  (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html)
 
 One of userspace's assumptions is that when you allocate a new BO, you
 can map it and start writing data into it without needing to wait on the
 GPU.  I expect this patch to mostly hurt performance on apps (and I note
 that the patch doesn't come with any actual performance data) that get
 more stalls as a result.
 
Hi Eric,
Yes, it may hurt the performance on apps, in case of small buffers and 
if blitter engine is busy as there is a synchronous wait for rendering 
in the gem_fault handler. If that is the case, we can drop this from the 
gem_fault routine and employ it only in the do_execbuffer routine. Its 
useful there because there is no synchronous wait required in sw, due 
to cross ring synchronization.
We'll gather the numbers to quantify the performance benefit we have
while using blitter engines in this way for different buffer sizes.

 More importantly, though, it breaks existing userspace that relies on
 buffers being idle on allocation, for the unsychronized maps used in
 intel_bufferobj_subdata() and
 intel_bufferobj_map_range(GL_INVALIDATE_BUFFER_BIT |
 GL_UNSYNCHRONIZED_BIT)

Sorry, I miss your point here. It may not break this assumption due to
the fact that we employ this method only in case of the preallocate
routine, which will be called in the first page fault of the object
(gem_fault handler) resulting in fresh allocation of pages. 


So, in case of unsynchronized maps, there may be a wait involved in the
first page fault. Also, that wait time may be lesser than the time
required for CPU memset (resulting in no performance hit).
There won't be any subsequent waits afterwards for that buffer object.

Though, we'll have performance hit in the case when blitter engine is
already busy and may not be available to immediately start the memset of
freshly allocated mmaped buffers.

Am I missing something here? Does the userspace requirement for
unsynchronized mapped objects involve complete idleness of object on gpu
even when object page faults for the first time?

Regards,
Sourab

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.

2014-05-07 Thread Eric Anholt
Gupta, Sourab sourab.gu...@intel.com writes:

 On Tue, 2014-05-06 at 17:56 +, Eric Anholt wrote:
 sourab.gu...@intel.com writes:
 
  From: Sourab Gupta sourab.gu...@intel.com
 
  This patch is in continuation of and is dependent on earlier patch
  series to 'reduce the time for which device mutex is kept locked'.
  (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html)
 
 One of userspace's assumptions is that when you allocate a new BO, you
 can map it and start writing data into it without needing to wait on the
 GPU.  I expect this patch to mostly hurt performance on apps (and I note
 that the patch doesn't come with any actual performance data) that get
 more stalls as a result.
 
 Hi Eric,
 Yes, it may hurt the performance on apps, in case of small buffers and 
 if blitter engine is busy as there is a synchronous wait for rendering 
 in the gem_fault handler. If that is the case, we can drop this from the 
 gem_fault routine and employ it only in the do_execbuffer routine. Its 
 useful there because there is no synchronous wait required in sw, due 
 to cross ring synchronization.
 We'll gather the numbers to quantify the performance benefit we have
 while using blitter engines in this way for different buffer sizes.

 More importantly, though, it breaks existing userspace that relies on
 buffers being idle on allocation, for the unsychronized maps used in
 intel_bufferobj_subdata() and
 intel_bufferobj_map_range(GL_INVALIDATE_BUFFER_BIT |
 GL_UNSYNCHRONIZED_BIT)

 Sorry, I miss your point here. It may not break this assumption due to
 the fact that we employ this method only in case of the preallocate
 routine, which will be called in the first page fault of the object
 (gem_fault handler) resulting in fresh allocation of pages. 


 So, in case of unsynchronized maps, there may be a wait involved in the
 first page fault. Also, that wait time may be lesser than the time
 required for CPU memset (resulting in no performance hit).
 There won't be any subsequent waits afterwards for that buffer object.

 Though, we'll have performance hit in the case when blitter engine is
 already busy and may not be available to immediately start the memset of
 freshly allocated mmaped buffers.

 Am I missing something here? Does the userspace requirement for
 unsynchronized mapped objects involve complete idleness of object on gpu
 even when object page faults for the first time?

Oh, I mised how this works.  So at pagefault time, you're firing off the
blit, then immediately stalling on it?  This sounds even less like a
possible performance win than I was initially thinking.



pgp1fWzwjC48O.pgp
Description: PGP signature
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.

2014-05-06 Thread sourab . gupta
From: Sourab Gupta sourab.gu...@intel.com

This patch is in continuation of and is dependent on earlier patch
series to 'reduce the time for which device mutex is kept locked'.
(http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html)

This patch aims to reduce the allocation time of pages from shmem
by using blitter engines for clearing the pages freshly alloced.
This is utilized in case of fresh pages allocated in shmem_preallocate
routines in execbuffer path and page_fault path only.

Even though the CPU memset routine is optimized, but still sometimes
the time consumed in clearing the pages of a large buffer comes in
the order of milliseconds.
We intend to make this operation asynchronous by using blitter engine,
so irrespective of the size of buffer to be cleared, the execbuffer
ioctl processing time will not be affected. Use of blitter engine will
make the overall execution time of 'execbuffer' ioctl lesser for
large buffers.

There may be power implications here on using blitter engines, and
we have to evaluate this. As a next step, we can selectively enable
this HW based memset only for large buffers, where the overhead of
adding commands in a blitter ring(which will otherwise be idle),
cross ring synchronization, will be negligible compared to using the
CPU to clear out the buffer.

Signed-off-by: Sourab Gupta sourab.gu...@intel.com
Signed-off-by: Akash Goel akash.g...@intel.com
---
 drivers/gpu/drm/i915/i915_drv.h|  6 +++
 drivers/gpu/drm/i915/i915_gem.c| 80 +-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 43 
 include/linux/shmem_fs.h   |  2 +
 mm/shmem.c | 44 +++-
 5 files changed, 173 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 6dc579a..c3844da 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1596,6 +1596,10 @@ struct drm_i915_gem_object {
unsigned int has_aliasing_ppgtt_mapping:1;
unsigned int has_global_gtt_mapping:1;
unsigned int has_dma_mapping:1;
+   /*
+* Do the pages of object need to be cleared after shmem allocation
+*/
+   unsigned int require_clear:1;
 
struct sg_table *pages;
int pages_pin_count;
@@ -2120,6 +2124,8 @@ int i915_gem_mmap_gtt(struct drm_file *file_priv, struct 
drm_device *dev,
  uint32_t handle, uint64_t *offset);
 void
 i915_gem_object_shmem_preallocate(struct drm_i915_gem_object *obj);
+int i915_add_clear_obj_cmd(struct drm_i915_gem_object *obj);
+int i915_gem_memset_obj_hw(struct drm_i915_gem_object *obj);
 
 /**
  * Returns true if seq1 is later than seq2.
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 867da2d..972695a 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1376,6 +1376,7 @@ i915_gem_mmap_ioctl(struct drm_device *dev, void *data,
return 0;
 }
 
+
 /**
  * i915_gem_fault - fault a page into the GTT
  * vma: VMA in question
@@ -1436,6 +1437,12 @@ int i915_gem_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
if (ret)
goto unlock;
 
+   if (obj-require_clear) {
+   i915_gem_object_flush_cpu_write_domain(obj, false);
+   i915_gem_memset_obj_hw(obj);
+   obj-require_clear = false;
+   }
+
ret = i915_gem_object_set_to_gtt_domain(obj, write);
if (ret)
goto unpin;
@@ -1927,12 +1934,13 @@ i915_gem_object_shmem_preallocate(struct 
drm_i915_gem_object *obj)
/* Get the list of pages out of our struct file
 * Fail silently without starting the shrinker
 */
+   obj-require_clear = 1;
mapping = file_inode(obj-base.filp)-i_mapping;
gfp = mapping_gfp_mask(mapping);
gfp |= __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD;
gfp = ~(__GFP_IO | __GFP_WAIT);
for (i = 0; i  page_count; i++) {
-   page = shmem_read_mapping_page_gfp(mapping, i, gfp);
+   page = shmem_read_mapping_page_gfp_noclear(mapping, i, gfp);
if (IS_ERR(page)) {
DRM_DEBUG_DRIVER(Failure for obj(%p), size(%x) at 
page(%d)\n,
obj, obj-base.size, i);
@@ -2173,6 +2181,76 @@ i915_gem_object_move_to_inactive(struct 
drm_i915_gem_object *obj)
WARN_ON(i915_verify_lists(dev));
 }
 
+int i915_add_clear_obj_cmd(struct drm_i915_gem_object *obj)
+{
+   struct drm_i915_private *dev_priv = obj-base.dev-dev_private;
+   struct intel_ring_buffer *ring = dev_priv-ring[BCS];
+   u32 offset = i915_gem_obj_ggtt_offset(obj);
+   int ret;
+
+   ret = intel_ring_begin(ring, 6);
+   if (ret)
+   return ret;
+
+   intel_ring_emit(ring, (0x2  29) | (0x40  22) |
+ 

Re: [Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.

2014-05-06 Thread Chris Wilson
On Tue, May 06, 2014 at 04:40:58PM +0530, sourab.gu...@intel.com wrote:
 From: Sourab Gupta sourab.gu...@intel.com
 
 This patch is in continuation of and is dependent on earlier patch
 series to 'reduce the time for which device mutex is kept locked'.
 (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html)
 
 This patch aims to reduce the allocation time of pages from shmem
 by using blitter engines for clearing the pages freshly alloced.
 This is utilized in case of fresh pages allocated in shmem_preallocate
 routines in execbuffer path and page_fault path only.
 
 Even though the CPU memset routine is optimized, but still sometimes
 the time consumed in clearing the pages of a large buffer comes in
 the order of milliseconds.
 We intend to make this operation asynchronous by using blitter engine,
 so irrespective of the size of buffer to be cleared, the execbuffer
 ioctl processing time will not be affected. Use of blitter engine will
 make the overall execution time of 'execbuffer' ioctl lesser for
 large buffers.
 
 There may be power implications here on using blitter engines, and
 we have to evaluate this. As a next step, we can selectively enable
 this HW based memset only for large buffers, where the overhead of
 adding commands in a blitter ring(which will otherwise be idle),
 cross ring synchronization, will be negligible compared to using the
 CPU to clear out the buffer.

You leave a lot of holes by which you leak the uncleared pages to
userspace.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.

2014-05-06 Thread Gupta, Sourab
On Tue, 2014-05-06 at 11:34 +, Chris Wilson wrote:
 On Tue, May 06, 2014 at 04:40:58PM +0530, sourab.gu...@intel.com wrote:
  From: Sourab Gupta sourab.gu...@intel.com
  
  This patch is in continuation of and is dependent on earlier patch
  series to 'reduce the time for which device mutex is kept locked'.
  (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html)
  
  This patch aims to reduce the allocation time of pages from shmem
  by using blitter engines for clearing the pages freshly alloced.
  This is utilized in case of fresh pages allocated in shmem_preallocate
  routines in execbuffer path and page_fault path only.
  
  Even though the CPU memset routine is optimized, but still sometimes
  the time consumed in clearing the pages of a large buffer comes in
  the order of milliseconds.
  We intend to make this operation asynchronous by using blitter engine,
  so irrespective of the size of buffer to be cleared, the execbuffer
  ioctl processing time will not be affected. Use of blitter engine will
  make the overall execution time of 'execbuffer' ioctl lesser for
  large buffers.
  
  There may be power implications here on using blitter engines, and
  we have to evaluate this. As a next step, we can selectively enable
  this HW based memset only for large buffers, where the overhead of
  adding commands in a blitter ring(which will otherwise be idle),
  cross ring synchronization, will be negligible compared to using the
  CPU to clear out the buffer.
 
 You leave a lot of holes by which you leak the uncleared pages to
 userspace.
 -Chris
 
Hi Chris,

Are you ok with the overall design as such, and the
shmem_read_mapping_page_gfp_noclear interface?
Is the leak of uncleared pages happening due to implementation issues?
If so, we'll try to mitigate them.

Regards,
Sourab

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.

2014-05-06 Thread Chris Wilson
On Tue, May 06, 2014 at 12:59:37PM +, Gupta, Sourab wrote:
 On Tue, 2014-05-06 at 11:34 +, Chris Wilson wrote:
  On Tue, May 06, 2014 at 04:40:58PM +0530, sourab.gu...@intel.com wrote:
   From: Sourab Gupta sourab.gu...@intel.com
   
   This patch is in continuation of and is dependent on earlier patch
   series to 'reduce the time for which device mutex is kept locked'.
   (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html)
   
   This patch aims to reduce the allocation time of pages from shmem
   by using blitter engines for clearing the pages freshly alloced.
   This is utilized in case of fresh pages allocated in shmem_preallocate
   routines in execbuffer path and page_fault path only.
   
   Even though the CPU memset routine is optimized, but still sometimes
   the time consumed in clearing the pages of a large buffer comes in
   the order of milliseconds.
   We intend to make this operation asynchronous by using blitter engine,
   so irrespective of the size of buffer to be cleared, the execbuffer
   ioctl processing time will not be affected. Use of blitter engine will
   make the overall execution time of 'execbuffer' ioctl lesser for
   large buffers.
   
   There may be power implications here on using blitter engines, and
   we have to evaluate this. As a next step, we can selectively enable
   this HW based memset only for large buffers, where the overhead of
   adding commands in a blitter ring(which will otherwise be idle),
   cross ring synchronization, will be negligible compared to using the
   CPU to clear out the buffer.
  
  You leave a lot of holes by which you leak the uncleared pages to
  userspace.
  -Chris
  
 Hi Chris,
 
 Are you ok with the overall design as such, and the
 shmem_read_mapping_page_gfp_noclear interface?
 Is the leak of uncleared pages happening due to implementation issues?
 If so, we'll try to mitigate them.

Actually, along similar lines there is an even more fundamental issue.
You should only clear the objects if the pages have not been
prepopulated.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.

2014-05-06 Thread Eric Anholt
sourab.gu...@intel.com writes:

 From: Sourab Gupta sourab.gu...@intel.com

 This patch is in continuation of and is dependent on earlier patch
 series to 'reduce the time for which device mutex is kept locked'.
 (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html)

One of userspace's assumptions is that when you allocate a new BO, you
can map it and start writing data into it without needing to wait on the
GPU.  I expect this patch to mostly hurt performance on apps (and I note
that the patch doesn't come with any actual performance data) that get
more stalls as a result.

More importantly, though, it breaks existing userspace that relies on
buffers being idle on allocation, for the unsychronized maps used in
intel_bufferobj_subdata() and
intel_bufferobj_map_range(GL_INVALIDATE_BUFFER_BIT |
GL_UNSYNCHRONIZED_BIT)


pgpAgL0M0P5Vx.pgp
Description: PGP signature
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx