Re: [Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.
On Tue, 2014-05-06 at 13:12 +, Chris Wilson wrote: On Tue, May 06, 2014 at 12:59:37PM +, Gupta, Sourab wrote: On Tue, 2014-05-06 at 11:34 +, Chris Wilson wrote: On Tue, May 06, 2014 at 04:40:58PM +0530, sourab.gu...@intel.com wrote: From: Sourab Gupta sourab.gu...@intel.com This patch is in continuation of and is dependent on earlier patch series to 'reduce the time for which device mutex is kept locked'. (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html) This patch aims to reduce the allocation time of pages from shmem by using blitter engines for clearing the pages freshly alloced. This is utilized in case of fresh pages allocated in shmem_preallocate routines in execbuffer path and page_fault path only. Even though the CPU memset routine is optimized, but still sometimes the time consumed in clearing the pages of a large buffer comes in the order of milliseconds. We intend to make this operation asynchronous by using blitter engine, so irrespective of the size of buffer to be cleared, the execbuffer ioctl processing time will not be affected. Use of blitter engine will make the overall execution time of 'execbuffer' ioctl lesser for large buffers. There may be power implications here on using blitter engines, and we have to evaluate this. As a next step, we can selectively enable this HW based memset only for large buffers, where the overhead of adding commands in a blitter ring(which will otherwise be idle), cross ring synchronization, will be negligible compared to using the CPU to clear out the buffer. You leave a lot of holes by which you leak the uncleared pages to userspace. -Chris Hi Chris, Are you ok with the overall design as such, and the shmem_read_mapping_page_gfp_noclear interface? Is the leak of uncleared pages happening due to implementation issues? If so, we'll try to mitigate them. Actually, along similar lines there is an even more fundamental issue. You should only clear the objects if the pages have not been prepopulated. -Chris Hi Chris, This patch is in continuation of the shmem preallocate patch sent by Akash earlier. (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044597.html) We employ this method only in case of the preallocate routine, which will be called in the first page fault of the object resulting in fresh allocation of pages. This is controlled by means of a flag 'require_clear' which is set in preallocate routine(which will be come into picture only in case of fresh allocation). If pages are already populated for the object, this won't come into picture. Also, we'll try to fix the leak of uncleared pages due to any implementation issues. Regards, Sourab ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.
On Tue, 2014-05-06 at 17:56 +, Eric Anholt wrote: sourab.gu...@intel.com writes: From: Sourab Gupta sourab.gu...@intel.com This patch is in continuation of and is dependent on earlier patch series to 'reduce the time for which device mutex is kept locked'. (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html) One of userspace's assumptions is that when you allocate a new BO, you can map it and start writing data into it without needing to wait on the GPU. I expect this patch to mostly hurt performance on apps (and I note that the patch doesn't come with any actual performance data) that get more stalls as a result. Hi Eric, Yes, it may hurt the performance on apps, in case of small buffers and if blitter engine is busy as there is a synchronous wait for rendering in the gem_fault handler. If that is the case, we can drop this from the gem_fault routine and employ it only in the do_execbuffer routine. Its useful there because there is no synchronous wait required in sw, due to cross ring synchronization. We'll gather the numbers to quantify the performance benefit we have while using blitter engines in this way for different buffer sizes. More importantly, though, it breaks existing userspace that relies on buffers being idle on allocation, for the unsychronized maps used in intel_bufferobj_subdata() and intel_bufferobj_map_range(GL_INVALIDATE_BUFFER_BIT | GL_UNSYNCHRONIZED_BIT) Sorry, I miss your point here. It may not break this assumption due to the fact that we employ this method only in case of the preallocate routine, which will be called in the first page fault of the object (gem_fault handler) resulting in fresh allocation of pages. So, in case of unsynchronized maps, there may be a wait involved in the first page fault. Also, that wait time may be lesser than the time required for CPU memset (resulting in no performance hit). There won't be any subsequent waits afterwards for that buffer object. Though, we'll have performance hit in the case when blitter engine is already busy and may not be available to immediately start the memset of freshly allocated mmaped buffers. Am I missing something here? Does the userspace requirement for unsynchronized mapped objects involve complete idleness of object on gpu even when object page faults for the first time? Regards, Sourab ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.
Gupta, Sourab sourab.gu...@intel.com writes: On Tue, 2014-05-06 at 17:56 +, Eric Anholt wrote: sourab.gu...@intel.com writes: From: Sourab Gupta sourab.gu...@intel.com This patch is in continuation of and is dependent on earlier patch series to 'reduce the time for which device mutex is kept locked'. (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html) One of userspace's assumptions is that when you allocate a new BO, you can map it and start writing data into it without needing to wait on the GPU. I expect this patch to mostly hurt performance on apps (and I note that the patch doesn't come with any actual performance data) that get more stalls as a result. Hi Eric, Yes, it may hurt the performance on apps, in case of small buffers and if blitter engine is busy as there is a synchronous wait for rendering in the gem_fault handler. If that is the case, we can drop this from the gem_fault routine and employ it only in the do_execbuffer routine. Its useful there because there is no synchronous wait required in sw, due to cross ring synchronization. We'll gather the numbers to quantify the performance benefit we have while using blitter engines in this way for different buffer sizes. More importantly, though, it breaks existing userspace that relies on buffers being idle on allocation, for the unsychronized maps used in intel_bufferobj_subdata() and intel_bufferobj_map_range(GL_INVALIDATE_BUFFER_BIT | GL_UNSYNCHRONIZED_BIT) Sorry, I miss your point here. It may not break this assumption due to the fact that we employ this method only in case of the preallocate routine, which will be called in the first page fault of the object (gem_fault handler) resulting in fresh allocation of pages. So, in case of unsynchronized maps, there may be a wait involved in the first page fault. Also, that wait time may be lesser than the time required for CPU memset (resulting in no performance hit). There won't be any subsequent waits afterwards for that buffer object. Though, we'll have performance hit in the case when blitter engine is already busy and may not be available to immediately start the memset of freshly allocated mmaped buffers. Am I missing something here? Does the userspace requirement for unsynchronized mapped objects involve complete idleness of object on gpu even when object page faults for the first time? Oh, I mised how this works. So at pagefault time, you're firing off the blit, then immediately stalling on it? This sounds even less like a possible performance win than I was initially thinking. pgp1fWzwjC48O.pgp Description: PGP signature ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
[Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.
From: Sourab Gupta sourab.gu...@intel.com This patch is in continuation of and is dependent on earlier patch series to 'reduce the time for which device mutex is kept locked'. (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html) This patch aims to reduce the allocation time of pages from shmem by using blitter engines for clearing the pages freshly alloced. This is utilized in case of fresh pages allocated in shmem_preallocate routines in execbuffer path and page_fault path only. Even though the CPU memset routine is optimized, but still sometimes the time consumed in clearing the pages of a large buffer comes in the order of milliseconds. We intend to make this operation asynchronous by using blitter engine, so irrespective of the size of buffer to be cleared, the execbuffer ioctl processing time will not be affected. Use of blitter engine will make the overall execution time of 'execbuffer' ioctl lesser for large buffers. There may be power implications here on using blitter engines, and we have to evaluate this. As a next step, we can selectively enable this HW based memset only for large buffers, where the overhead of adding commands in a blitter ring(which will otherwise be idle), cross ring synchronization, will be negligible compared to using the CPU to clear out the buffer. Signed-off-by: Sourab Gupta sourab.gu...@intel.com Signed-off-by: Akash Goel akash.g...@intel.com --- drivers/gpu/drm/i915/i915_drv.h| 6 +++ drivers/gpu/drm/i915/i915_gem.c| 80 +- drivers/gpu/drm/i915/i915_gem_execbuffer.c | 43 include/linux/shmem_fs.h | 2 + mm/shmem.c | 44 +++- 5 files changed, 173 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 6dc579a..c3844da 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -1596,6 +1596,10 @@ struct drm_i915_gem_object { unsigned int has_aliasing_ppgtt_mapping:1; unsigned int has_global_gtt_mapping:1; unsigned int has_dma_mapping:1; + /* +* Do the pages of object need to be cleared after shmem allocation +*/ + unsigned int require_clear:1; struct sg_table *pages; int pages_pin_count; @@ -2120,6 +2124,8 @@ int i915_gem_mmap_gtt(struct drm_file *file_priv, struct drm_device *dev, uint32_t handle, uint64_t *offset); void i915_gem_object_shmem_preallocate(struct drm_i915_gem_object *obj); +int i915_add_clear_obj_cmd(struct drm_i915_gem_object *obj); +int i915_gem_memset_obj_hw(struct drm_i915_gem_object *obj); /** * Returns true if seq1 is later than seq2. diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 867da2d..972695a 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1376,6 +1376,7 @@ i915_gem_mmap_ioctl(struct drm_device *dev, void *data, return 0; } + /** * i915_gem_fault - fault a page into the GTT * vma: VMA in question @@ -1436,6 +1437,12 @@ int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) if (ret) goto unlock; + if (obj-require_clear) { + i915_gem_object_flush_cpu_write_domain(obj, false); + i915_gem_memset_obj_hw(obj); + obj-require_clear = false; + } + ret = i915_gem_object_set_to_gtt_domain(obj, write); if (ret) goto unpin; @@ -1927,12 +1934,13 @@ i915_gem_object_shmem_preallocate(struct drm_i915_gem_object *obj) /* Get the list of pages out of our struct file * Fail silently without starting the shrinker */ + obj-require_clear = 1; mapping = file_inode(obj-base.filp)-i_mapping; gfp = mapping_gfp_mask(mapping); gfp |= __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD; gfp = ~(__GFP_IO | __GFP_WAIT); for (i = 0; i page_count; i++) { - page = shmem_read_mapping_page_gfp(mapping, i, gfp); + page = shmem_read_mapping_page_gfp_noclear(mapping, i, gfp); if (IS_ERR(page)) { DRM_DEBUG_DRIVER(Failure for obj(%p), size(%x) at page(%d)\n, obj, obj-base.size, i); @@ -2173,6 +2181,76 @@ i915_gem_object_move_to_inactive(struct drm_i915_gem_object *obj) WARN_ON(i915_verify_lists(dev)); } +int i915_add_clear_obj_cmd(struct drm_i915_gem_object *obj) +{ + struct drm_i915_private *dev_priv = obj-base.dev-dev_private; + struct intel_ring_buffer *ring = dev_priv-ring[BCS]; + u32 offset = i915_gem_obj_ggtt_offset(obj); + int ret; + + ret = intel_ring_begin(ring, 6); + if (ret) + return ret; + + intel_ring_emit(ring, (0x2 29) | (0x40 22) | +
Re: [Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.
On Tue, May 06, 2014 at 04:40:58PM +0530, sourab.gu...@intel.com wrote: From: Sourab Gupta sourab.gu...@intel.com This patch is in continuation of and is dependent on earlier patch series to 'reduce the time for which device mutex is kept locked'. (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html) This patch aims to reduce the allocation time of pages from shmem by using blitter engines for clearing the pages freshly alloced. This is utilized in case of fresh pages allocated in shmem_preallocate routines in execbuffer path and page_fault path only. Even though the CPU memset routine is optimized, but still sometimes the time consumed in clearing the pages of a large buffer comes in the order of milliseconds. We intend to make this operation asynchronous by using blitter engine, so irrespective of the size of buffer to be cleared, the execbuffer ioctl processing time will not be affected. Use of blitter engine will make the overall execution time of 'execbuffer' ioctl lesser for large buffers. There may be power implications here on using blitter engines, and we have to evaluate this. As a next step, we can selectively enable this HW based memset only for large buffers, where the overhead of adding commands in a blitter ring(which will otherwise be idle), cross ring synchronization, will be negligible compared to using the CPU to clear out the buffer. You leave a lot of holes by which you leak the uncleared pages to userspace. -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.
On Tue, 2014-05-06 at 11:34 +, Chris Wilson wrote: On Tue, May 06, 2014 at 04:40:58PM +0530, sourab.gu...@intel.com wrote: From: Sourab Gupta sourab.gu...@intel.com This patch is in continuation of and is dependent on earlier patch series to 'reduce the time for which device mutex is kept locked'. (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html) This patch aims to reduce the allocation time of pages from shmem by using blitter engines for clearing the pages freshly alloced. This is utilized in case of fresh pages allocated in shmem_preallocate routines in execbuffer path and page_fault path only. Even though the CPU memset routine is optimized, but still sometimes the time consumed in clearing the pages of a large buffer comes in the order of milliseconds. We intend to make this operation asynchronous by using blitter engine, so irrespective of the size of buffer to be cleared, the execbuffer ioctl processing time will not be affected. Use of blitter engine will make the overall execution time of 'execbuffer' ioctl lesser for large buffers. There may be power implications here on using blitter engines, and we have to evaluate this. As a next step, we can selectively enable this HW based memset only for large buffers, where the overhead of adding commands in a blitter ring(which will otherwise be idle), cross ring synchronization, will be negligible compared to using the CPU to clear out the buffer. You leave a lot of holes by which you leak the uncleared pages to userspace. -Chris Hi Chris, Are you ok with the overall design as such, and the shmem_read_mapping_page_gfp_noclear interface? Is the leak of uncleared pages happening due to implementation issues? If so, we'll try to mitigate them. Regards, Sourab ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.
On Tue, May 06, 2014 at 12:59:37PM +, Gupta, Sourab wrote: On Tue, 2014-05-06 at 11:34 +, Chris Wilson wrote: On Tue, May 06, 2014 at 04:40:58PM +0530, sourab.gu...@intel.com wrote: From: Sourab Gupta sourab.gu...@intel.com This patch is in continuation of and is dependent on earlier patch series to 'reduce the time for which device mutex is kept locked'. (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html) This patch aims to reduce the allocation time of pages from shmem by using blitter engines for clearing the pages freshly alloced. This is utilized in case of fresh pages allocated in shmem_preallocate routines in execbuffer path and page_fault path only. Even though the CPU memset routine is optimized, but still sometimes the time consumed in clearing the pages of a large buffer comes in the order of milliseconds. We intend to make this operation asynchronous by using blitter engine, so irrespective of the size of buffer to be cleared, the execbuffer ioctl processing time will not be affected. Use of blitter engine will make the overall execution time of 'execbuffer' ioctl lesser for large buffers. There may be power implications here on using blitter engines, and we have to evaluate this. As a next step, we can selectively enable this HW based memset only for large buffers, where the overhead of adding commands in a blitter ring(which will otherwise be idle), cross ring synchronization, will be negligible compared to using the CPU to clear out the buffer. You leave a lot of holes by which you leak the uncleared pages to userspace. -Chris Hi Chris, Are you ok with the overall design as such, and the shmem_read_mapping_page_gfp_noclear interface? Is the leak of uncleared pages happening due to implementation issues? If so, we'll try to mitigate them. Actually, along similar lines there is an even more fundamental issue. You should only clear the objects if the pages have not been prepopulated. -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC] drm/i915 : Reduce the shmem page allocation time by using blitter engines for clearing pages.
sourab.gu...@intel.com writes: From: Sourab Gupta sourab.gu...@intel.com This patch is in continuation of and is dependent on earlier patch series to 'reduce the time for which device mutex is kept locked'. (http://lists.freedesktop.org/archives/intel-gfx/2014-May/044596.html) One of userspace's assumptions is that when you allocate a new BO, you can map it and start writing data into it without needing to wait on the GPU. I expect this patch to mostly hurt performance on apps (and I note that the patch doesn't come with any actual performance data) that get more stalls as a result. More importantly, though, it breaks existing userspace that relies on buffers being idle on allocation, for the unsychronized maps used in intel_bufferobj_subdata() and intel_bufferobj_map_range(GL_INVALIDATE_BUFFER_BIT | GL_UNSYNCHRONIZED_BIT) pgpAgL0M0P5Vx.pgp Description: PGP signature ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx