[Intel-gfx] [PATCH] drm/i915: Allocate atomically in execbuf path

2013-11-26 Thread Ben Widawsky
If we end up calling the shrinker, which in turn requires the OOM
killer, we may end up infinitely waiting for a process to die if the OOM
chooses. The case that this prevents occurs in execbuf. The forked
variants of gem_evict_everything is a good way to hit it. This is
exacerbated by Daniel's recent patch to give OOM precedence to the GEM
tests.

It's a twisted form of a deadlock.

What occurs is the following (assume just 2 procs)
1. proc A gets to execbuf while out of memory, gets struct_mutex.
2. OOM killer comes in and chooses proc B
3. proc B closes it's fds, which requires struct mutex, blocks
4, OOM killer waits for B to die before killing another process (this
part is speculative)

Cc: Daniel Vetter daniel.vet...@ffwll.ch
Cc: Chris Wilson ch...@chris-wilson.co.uk
Signed-off-by: Ben Widawsky b...@bwidawsk.net
---
 drivers/gpu/drm/i915/i915_gem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index fb2d548..a60894d 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1842,12 +1842,12 @@ i915_gem_object_get_pages_gtt(struct 
drm_i915_gem_object *obj)
BUG_ON(obj-base.read_domains  I915_GEM_GPU_DOMAINS);
BUG_ON(obj-base.write_domain  I915_GEM_GPU_DOMAINS);
 
-   st = kmalloc(sizeof(*st), GFP_KERNEL);
+   st = kmalloc(sizeof(*st), GFP_NOWAIT);
if (st == NULL)
return -ENOMEM;
 
page_count = obj-base.size / PAGE_SIZE;
-   if (sg_alloc_table(st, page_count, GFP_KERNEL)) {
+   if (sg_alloc_table(st, page_count, GFP_NOWAIT)) {
kfree(st);
return -ENOMEM;
}
-- 
1.8.4.2

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Allocate atomically in execbuf path

2013-11-26 Thread Ben Widawsky
On Tue, Nov 26, 2013 at 04:55:50PM -0800, Ben Widawsky wrote:
 If we end up calling the shrinker, which in turn requires the OOM
 killer, we may end up infinitely waiting for a process to die if the OOM
 chooses. The case that this prevents occurs in execbuf. The forked
 variants of gem_evict_everything is a good way to hit it. This is
 exacerbated by Daniel's recent patch to give OOM precedence to the GEM
 tests.
 
 It's a twisted form of a deadlock.
 
 What occurs is the following (assume just 2 procs)
 1. proc A gets to execbuf while out of memory, gets struct_mutex.
 2. OOM killer comes in and chooses proc B
 3. proc B closes it's fds, which requires struct mutex, blocks
 4, OOM killer waits for B to die before killing another process (this
 part is speculative)
 

It appears that by itself, this patch is insufficient to prevent the
failure when run in piglit. Before I go on a wild goose chase of finding
all allocations, maybe I'll give people a chance to chime in. The
symptom is the same always, OOM, procs can't die because struct_mutex is
held.

 Cc: Daniel Vetter daniel.vet...@ffwll.ch
 Cc: Chris Wilson ch...@chris-wilson.co.uk
 Signed-off-by: Ben Widawsky b...@bwidawsk.net
 ---
  drivers/gpu/drm/i915/i915_gem.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
 index fb2d548..a60894d 100644
 --- a/drivers/gpu/drm/i915/i915_gem.c
 +++ b/drivers/gpu/drm/i915/i915_gem.c
 @@ -1842,12 +1842,12 @@ i915_gem_object_get_pages_gtt(struct 
 drm_i915_gem_object *obj)
   BUG_ON(obj-base.read_domains  I915_GEM_GPU_DOMAINS);
   BUG_ON(obj-base.write_domain  I915_GEM_GPU_DOMAINS);
  
 - st = kmalloc(sizeof(*st), GFP_KERNEL);
 + st = kmalloc(sizeof(*st), GFP_NOWAIT);
   if (st == NULL)
   return -ENOMEM;
  
   page_count = obj-base.size / PAGE_SIZE;
 - if (sg_alloc_table(st, page_count, GFP_KERNEL)) {
 + if (sg_alloc_table(st, page_count, GFP_NOWAIT)) {
   kfree(st);
   return -ENOMEM;
   }
 -- 
 1.8.4.2
 

-- 
Ben Widawsky, Intel Open Source Technology Center
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Allocate atomically in execbuf path

2013-11-26 Thread Ben Widawsky
On Tue, Nov 26, 2013 at 05:10:38PM -0800, Ben Widawsky wrote:
 On Tue, Nov 26, 2013 at 04:55:50PM -0800, Ben Widawsky wrote:
  If we end up calling the shrinker, which in turn requires the OOM
  killer, we may end up infinitely waiting for a process to die if the OOM
  chooses. The case that this prevents occurs in execbuf. The forked
  variants of gem_evict_everything is a good way to hit it. This is
  exacerbated by Daniel's recent patch to give OOM precedence to the GEM
  tests.
  
  It's a twisted form of a deadlock.
  
  What occurs is the following (assume just 2 procs)
  1. proc A gets to execbuf while out of memory, gets struct_mutex.
  2. OOM killer comes in and chooses proc B
  3. proc B closes it's fds, which requires struct mutex, blocks
  4, OOM killer waits for B to die before killing another process (this
  part is speculative)
  
 
 It appears that by itself, this patch is insufficient to prevent the
 failure when run in piglit. Before I go on a wild goose chase of finding
 all allocations, maybe I'll give people a chance to chime in. The
 symptom is the same always, OOM, procs can't die because struct_mutex is
 held.
 

too late on the goose chase. vma allocation was the missing bit.

  Cc: Daniel Vetter daniel.vet...@ffwll.ch
  Cc: Chris Wilson ch...@chris-wilson.co.uk
  Signed-off-by: Ben Widawsky b...@bwidawsk.net
  ---
   drivers/gpu/drm/i915/i915_gem.c | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)
  
  diff --git a/drivers/gpu/drm/i915/i915_gem.c 
  b/drivers/gpu/drm/i915/i915_gem.c
  index fb2d548..a60894d 100644
  --- a/drivers/gpu/drm/i915/i915_gem.c
  +++ b/drivers/gpu/drm/i915/i915_gem.c
  @@ -1842,12 +1842,12 @@ i915_gem_object_get_pages_gtt(struct 
  drm_i915_gem_object *obj)
  BUG_ON(obj-base.read_domains  I915_GEM_GPU_DOMAINS);
  BUG_ON(obj-base.write_domain  I915_GEM_GPU_DOMAINS);
   
  -   st = kmalloc(sizeof(*st), GFP_KERNEL);
  +   st = kmalloc(sizeof(*st), GFP_NOWAIT);
  if (st == NULL)
  return -ENOMEM;
   
  page_count = obj-base.size / PAGE_SIZE;
  -   if (sg_alloc_table(st, page_count, GFP_KERNEL)) {
  +   if (sg_alloc_table(st, page_count, GFP_NOWAIT)) {
  kfree(st);
  return -ENOMEM;
  }
  -- 
  1.8.4.2
  
 
 -- 
 Ben Widawsky, Intel Open Source Technology Center
 ___
 Intel-gfx mailing list
 Intel-gfx@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Ben Widawsky, Intel Open Source Technology Center
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Allocate atomically in execbuf path

2013-11-26 Thread Ben Widawsky
On Tue, Nov 26, 2013 at 04:55:50PM -0800, Ben Widawsky wrote:
 If we end up calling the shrinker, which in turn requires the OOM
 killer, we may end up infinitely waiting for a process to die if the OOM
 chooses. The case that this prevents occurs in execbuf. The forked
 variants of gem_evict_everything is a good way to hit it. This is
 exacerbated by Daniel's recent patch to give OOM precedence to the GEM
 tests.
 
 It's a twisted form of a deadlock.
 
 What occurs is the following (assume just 2 procs)
 1. proc A gets to execbuf while out of memory, gets struct_mutex.
 2. OOM killer comes in and chooses proc B
 3. proc B closes it's fds, which requires struct mutex, blocks
 4, OOM killer waits for B to die before killing another process (this
 part is speculative)
 
 Cc: Daniel Vetter daniel.vet...@ffwll.ch
 Cc: Chris Wilson ch...@chris-wilson.co.uk
 Signed-off-by: Ben Widawsky b...@bwidawsk.net

I'd still like to know if I am crazy, but I'm now trying to defer the
stuff we do on file close without using any allocs. Just an update...

-- 
Ben Widawsky, Intel Open Source Technology Center
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Allocate atomically in execbuf path

2013-11-26 Thread Daniel Vetter
On Wed, Nov 27, 2013 at 5:23 AM, Ben Widawsky b...@bwidawsk.net wrote:
 On Tue, Nov 26, 2013 at 04:55:50PM -0800, Ben Widawsky wrote:
 If we end up calling the shrinker, which in turn requires the OOM
 killer, we may end up infinitely waiting for a process to die if the OOM
 chooses. The case that this prevents occurs in execbuf. The forked
 variants of gem_evict_everything is a good way to hit it. This is
 exacerbated by Daniel's recent patch to give OOM precedence to the GEM
 tests.

 It's a twisted form of a deadlock.

 What occurs is the following (assume just 2 procs)
 1. proc A gets to execbuf while out of memory, gets struct_mutex.
 2. OOM killer comes in and chooses proc B
 3. proc B closes it's fds, which requires struct mutex, blocks
 4, OOM killer waits for B to die before killing another process (this
 part is speculative)

 Cc: Daniel Vetter daniel.vet...@ffwll.ch
 Cc: Chris Wilson ch...@chris-wilson.co.uk
 Signed-off-by: Ben Widawsky b...@bwidawsk.net

 I'd still like to know if I am crazy, but I'm now trying to defer the
 stuff we do on file close without using any allocs. Just an update...

Sound's intrigueing, but tbh I don't really have clue about things.
What about adding the relevant stuck task backtraces to the patch and
submitting this to a wider audience (lkml, mm-devel) as an akpm-probe?
The more botched the patch, the better the probe usually.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Allocate atomically in execbuf path

2013-11-26 Thread Ben Widawsky
On Tue, Nov 26, 2013 at 08:23:46PM -0800, Ben Widawsky wrote:
 On Tue, Nov 26, 2013 at 04:55:50PM -0800, Ben Widawsky wrote:
  If we end up calling the shrinker, which in turn requires the OOM
  killer, we may end up infinitely waiting for a process to die if the OOM
  chooses. The case that this prevents occurs in execbuf. The forked
  variants of gem_evict_everything is a good way to hit it. This is
  exacerbated by Daniel's recent patch to give OOM precedence to the GEM
  tests.
  
  It's a twisted form of a deadlock.
  
  What occurs is the following (assume just 2 procs)
  1. proc A gets to execbuf while out of memory, gets struct_mutex.
  2. OOM killer comes in and chooses proc B
  3. proc B closes it's fds, which requires struct mutex, blocks
  4, OOM killer waits for B to die before killing another process (this
  part is speculative)
  
  Cc: Daniel Vetter daniel.vet...@ffwll.ch
  Cc: Chris Wilson ch...@chris-wilson.co.uk
  Signed-off-by: Ben Widawsky b...@bwidawsk.net
 
 I'd still like to know if I am crazy, but I'm now trying to defer the
 stuff we do on file close without using any allocs. Just an update...
 

workqueue still has similar problems. It could be because deferring the
context cleanup means we don't actually free much space, and so the OOM
isn't enough, or [more likely] something else is going on.

Maybe it's my bug. I am really out of ideas at the moment. The system
just becomes unresponsive after all threads end up blocked waiting for
struct mutex. I know I'd seen such problems in the past with
gem_evict_everything, but this time around I seem to be the sole cause
(and not all my machines hit it).

Sorry for the noise - just totally burning out on this one.

-- 
Ben Widawsky, Intel Open Source Technology Center
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx