[Intel-gfx] [PATCH] drm/i915: Allocate atomically in execbuf path
If we end up calling the shrinker, which in turn requires the OOM killer, we may end up infinitely waiting for a process to die if the OOM chooses. The case that this prevents occurs in execbuf. The forked variants of gem_evict_everything is a good way to hit it. This is exacerbated by Daniel's recent patch to give OOM precedence to the GEM tests. It's a twisted form of a deadlock. What occurs is the following (assume just 2 procs) 1. proc A gets to execbuf while out of memory, gets struct_mutex. 2. OOM killer comes in and chooses proc B 3. proc B closes it's fds, which requires struct mutex, blocks 4, OOM killer waits for B to die before killing another process (this part is speculative) Cc: Daniel Vetter daniel.vet...@ffwll.ch Cc: Chris Wilson ch...@chris-wilson.co.uk Signed-off-by: Ben Widawsky b...@bwidawsk.net --- drivers/gpu/drm/i915/i915_gem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index fb2d548..a60894d 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1842,12 +1842,12 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj) BUG_ON(obj-base.read_domains I915_GEM_GPU_DOMAINS); BUG_ON(obj-base.write_domain I915_GEM_GPU_DOMAINS); - st = kmalloc(sizeof(*st), GFP_KERNEL); + st = kmalloc(sizeof(*st), GFP_NOWAIT); if (st == NULL) return -ENOMEM; page_count = obj-base.size / PAGE_SIZE; - if (sg_alloc_table(st, page_count, GFP_KERNEL)) { + if (sg_alloc_table(st, page_count, GFP_NOWAIT)) { kfree(st); return -ENOMEM; } -- 1.8.4.2 ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH] drm/i915: Allocate atomically in execbuf path
On Tue, Nov 26, 2013 at 04:55:50PM -0800, Ben Widawsky wrote: If we end up calling the shrinker, which in turn requires the OOM killer, we may end up infinitely waiting for a process to die if the OOM chooses. The case that this prevents occurs in execbuf. The forked variants of gem_evict_everything is a good way to hit it. This is exacerbated by Daniel's recent patch to give OOM precedence to the GEM tests. It's a twisted form of a deadlock. What occurs is the following (assume just 2 procs) 1. proc A gets to execbuf while out of memory, gets struct_mutex. 2. OOM killer comes in and chooses proc B 3. proc B closes it's fds, which requires struct mutex, blocks 4, OOM killer waits for B to die before killing another process (this part is speculative) It appears that by itself, this patch is insufficient to prevent the failure when run in piglit. Before I go on a wild goose chase of finding all allocations, maybe I'll give people a chance to chime in. The symptom is the same always, OOM, procs can't die because struct_mutex is held. Cc: Daniel Vetter daniel.vet...@ffwll.ch Cc: Chris Wilson ch...@chris-wilson.co.uk Signed-off-by: Ben Widawsky b...@bwidawsk.net --- drivers/gpu/drm/i915/i915_gem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index fb2d548..a60894d 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1842,12 +1842,12 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj) BUG_ON(obj-base.read_domains I915_GEM_GPU_DOMAINS); BUG_ON(obj-base.write_domain I915_GEM_GPU_DOMAINS); - st = kmalloc(sizeof(*st), GFP_KERNEL); + st = kmalloc(sizeof(*st), GFP_NOWAIT); if (st == NULL) return -ENOMEM; page_count = obj-base.size / PAGE_SIZE; - if (sg_alloc_table(st, page_count, GFP_KERNEL)) { + if (sg_alloc_table(st, page_count, GFP_NOWAIT)) { kfree(st); return -ENOMEM; } -- 1.8.4.2 -- Ben Widawsky, Intel Open Source Technology Center ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH] drm/i915: Allocate atomically in execbuf path
On Tue, Nov 26, 2013 at 05:10:38PM -0800, Ben Widawsky wrote: On Tue, Nov 26, 2013 at 04:55:50PM -0800, Ben Widawsky wrote: If we end up calling the shrinker, which in turn requires the OOM killer, we may end up infinitely waiting for a process to die if the OOM chooses. The case that this prevents occurs in execbuf. The forked variants of gem_evict_everything is a good way to hit it. This is exacerbated by Daniel's recent patch to give OOM precedence to the GEM tests. It's a twisted form of a deadlock. What occurs is the following (assume just 2 procs) 1. proc A gets to execbuf while out of memory, gets struct_mutex. 2. OOM killer comes in and chooses proc B 3. proc B closes it's fds, which requires struct mutex, blocks 4, OOM killer waits for B to die before killing another process (this part is speculative) It appears that by itself, this patch is insufficient to prevent the failure when run in piglit. Before I go on a wild goose chase of finding all allocations, maybe I'll give people a chance to chime in. The symptom is the same always, OOM, procs can't die because struct_mutex is held. too late on the goose chase. vma allocation was the missing bit. Cc: Daniel Vetter daniel.vet...@ffwll.ch Cc: Chris Wilson ch...@chris-wilson.co.uk Signed-off-by: Ben Widawsky b...@bwidawsk.net --- drivers/gpu/drm/i915/i915_gem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index fb2d548..a60894d 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1842,12 +1842,12 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj) BUG_ON(obj-base.read_domains I915_GEM_GPU_DOMAINS); BUG_ON(obj-base.write_domain I915_GEM_GPU_DOMAINS); - st = kmalloc(sizeof(*st), GFP_KERNEL); + st = kmalloc(sizeof(*st), GFP_NOWAIT); if (st == NULL) return -ENOMEM; page_count = obj-base.size / PAGE_SIZE; - if (sg_alloc_table(st, page_count, GFP_KERNEL)) { + if (sg_alloc_table(st, page_count, GFP_NOWAIT)) { kfree(st); return -ENOMEM; } -- 1.8.4.2 -- Ben Widawsky, Intel Open Source Technology Center ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx -- Ben Widawsky, Intel Open Source Technology Center ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH] drm/i915: Allocate atomically in execbuf path
On Tue, Nov 26, 2013 at 04:55:50PM -0800, Ben Widawsky wrote: If we end up calling the shrinker, which in turn requires the OOM killer, we may end up infinitely waiting for a process to die if the OOM chooses. The case that this prevents occurs in execbuf. The forked variants of gem_evict_everything is a good way to hit it. This is exacerbated by Daniel's recent patch to give OOM precedence to the GEM tests. It's a twisted form of a deadlock. What occurs is the following (assume just 2 procs) 1. proc A gets to execbuf while out of memory, gets struct_mutex. 2. OOM killer comes in and chooses proc B 3. proc B closes it's fds, which requires struct mutex, blocks 4, OOM killer waits for B to die before killing another process (this part is speculative) Cc: Daniel Vetter daniel.vet...@ffwll.ch Cc: Chris Wilson ch...@chris-wilson.co.uk Signed-off-by: Ben Widawsky b...@bwidawsk.net I'd still like to know if I am crazy, but I'm now trying to defer the stuff we do on file close without using any allocs. Just an update... -- Ben Widawsky, Intel Open Source Technology Center ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH] drm/i915: Allocate atomically in execbuf path
On Wed, Nov 27, 2013 at 5:23 AM, Ben Widawsky b...@bwidawsk.net wrote: On Tue, Nov 26, 2013 at 04:55:50PM -0800, Ben Widawsky wrote: If we end up calling the shrinker, which in turn requires the OOM killer, we may end up infinitely waiting for a process to die if the OOM chooses. The case that this prevents occurs in execbuf. The forked variants of gem_evict_everything is a good way to hit it. This is exacerbated by Daniel's recent patch to give OOM precedence to the GEM tests. It's a twisted form of a deadlock. What occurs is the following (assume just 2 procs) 1. proc A gets to execbuf while out of memory, gets struct_mutex. 2. OOM killer comes in and chooses proc B 3. proc B closes it's fds, which requires struct mutex, blocks 4, OOM killer waits for B to die before killing another process (this part is speculative) Cc: Daniel Vetter daniel.vet...@ffwll.ch Cc: Chris Wilson ch...@chris-wilson.co.uk Signed-off-by: Ben Widawsky b...@bwidawsk.net I'd still like to know if I am crazy, but I'm now trying to defer the stuff we do on file close without using any allocs. Just an update... Sound's intrigueing, but tbh I don't really have clue about things. What about adding the relevant stuck task backtraces to the patch and submitting this to a wider audience (lkml, mm-devel) as an akpm-probe? The more botched the patch, the better the probe usually. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH] drm/i915: Allocate atomically in execbuf path
On Tue, Nov 26, 2013 at 08:23:46PM -0800, Ben Widawsky wrote: On Tue, Nov 26, 2013 at 04:55:50PM -0800, Ben Widawsky wrote: If we end up calling the shrinker, which in turn requires the OOM killer, we may end up infinitely waiting for a process to die if the OOM chooses. The case that this prevents occurs in execbuf. The forked variants of gem_evict_everything is a good way to hit it. This is exacerbated by Daniel's recent patch to give OOM precedence to the GEM tests. It's a twisted form of a deadlock. What occurs is the following (assume just 2 procs) 1. proc A gets to execbuf while out of memory, gets struct_mutex. 2. OOM killer comes in and chooses proc B 3. proc B closes it's fds, which requires struct mutex, blocks 4, OOM killer waits for B to die before killing another process (this part is speculative) Cc: Daniel Vetter daniel.vet...@ffwll.ch Cc: Chris Wilson ch...@chris-wilson.co.uk Signed-off-by: Ben Widawsky b...@bwidawsk.net I'd still like to know if I am crazy, but I'm now trying to defer the stuff we do on file close without using any allocs. Just an update... workqueue still has similar problems. It could be because deferring the context cleanup means we don't actually free much space, and so the OOM isn't enough, or [more likely] something else is going on. Maybe it's my bug. I am really out of ideas at the moment. The system just becomes unresponsive after all threads end up blocked waiting for struct mutex. I know I'd seen such problems in the past with gem_evict_everything, but this time around I seem to be the sole cause (and not all my machines hit it). Sorry for the noise - just totally burning out on this one. -- Ben Widawsky, Intel Open Source Technology Center ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx