Re: printk deadlock due to double lock attempt on current CPU's runqueue

2021-11-10 Thread Sultan Alsawaf
On Wed, Nov 10, 2021 at 11:13:37AM +0106, John Ogness wrote:
> Even after we introduce kthread printers, there will still be situations
> where direct printing is used: booting (before kthreads exist) and
> shutdown/suspend/crash situations, when the kthreads may not be
> active.

Although I'm unaware of any ongoing kthread printer work, I'm curious to know
how a kthread approach wouldn't employ a try_to_wake_up() from directly inside
printk(), since the try_to_wake_up() hit from inside the fbcon code is what
caused my deadlock.

Sultan


Re: printk deadlock due to double lock attempt on current CPU's runqueue

2021-11-10 Thread Sultan Alsawaf
On Wed, Nov 10, 2021 at 10:00:35AM +0100, Vincent Guittot wrote:
> Is it the same SCHED_WARN_ON(rq->tmp_alone_branch !=
> >leaf_cfs_rq_list); that generates the deadlock on v5.15 too ?
> 
> one remaining tmp_alone_branch warning has been fixed in v5.15 with
> 2630cde26711 ("sched/fair: Add ancestors of unthrottled undecayed cfs_rq")

I should clarify that I didn't actually reproduce the issue on v5.15; I just saw
that the call chain leading to the deadlock still existed in v5.15 after looking
through the code.

Failing the SCHED_WARN_ON(rq->tmp_alone_branch != >leaf_cfs_rq_list); assert
is extremely rare in my experience, and I don't have a reproducer. It has only
happened once after months of heavy usage (with lots of reboots too, so not with
crazy high uptime).

Sultan


printk deadlock due to double lock attempt on current CPU's runqueue

2021-11-10 Thread Sultan Alsawaf
Hi,

I encountered a printk deadlock on 5.13 which appears to still affect the latest
kernel. The deadlock occurs due to printk being used while having the current
CPU's runqueue locked, and the underlying framebuffer console attempting to lock
the same runqueue when printk tries to flush the log buffer.

Here's the backtrace corresponding to the deadlock, with some notes I've added:
---8<---
  PID: 0  TASK: 888100d832c0  CPU: 11  COMMAND: "swapper/11"
  [exception RIP: native_halt+22]
  RIP: 8320eb97  RSP: c95b8540  RFLAGS: 0002
  RAX: 0001  RBX: 88821c534280  RCX: ed10438a6850
  RDX:   RSI: 0003  RDI: 88821c534280
  RBP: 0003   R8: 0001   R9: 81302973
  R10: 88821c534280  R11: ed10438a6850  R12: 88821c535080
  R13: 88825c0b32c0  R14:   R15: 88821c534280
  CS: 0010  SS: 0018
   #0 [c95b8540] kvm_wait at 81125cc1
   #1 [c95b8558] pv_wait_head_or_lock at 8130294a
   #2 [c95b8580] __pv_queued_spin_lock_slowpath at 81302d2e
   #3 [c95b8600] do_raw_spin_lock at 81303a33
   #4 [c95b8668] _raw_spin_lock at 8320f118 <-- 
***DEADLOCK***
   #5 [c95b8680] ttwu_queue at 8125cff9 <-- tries to 
lock this CPU's runqueue again...
   #6 [c95b86f8] try_to_wake_up at 8125d778
   #7 [c95b8780] wake_up_process at 8125d924
   #8 [c95b8788] wake_up_worker at 8121268b
   #9 [c95b8798] insert_work at 812199aa
  #10 [c95b87d0] __queue_work at 8121dcde
  #11 [c95b8810] queue_work_on at 8121e1ca
  #12 [c95b8838] drm_fb_helper_damage at c079b0a0 
[drm_kms_helper]
  #13 [c95b8878] drm_fb_helper_sys_imageblit at c079b613 
[drm_kms_helper]
  #14 [c95b88a8] drm_fbdev_fb_imageblit at c079b9e1 
[drm_kms_helper]
  #15 [c95b88c0] soft_cursor at 8236ffc9
  #16 [c95b8928] bit_cursor at 8236f207
  #17 [c95b8a50] fbcon_cursor at 82362c0f
  #18 [c95b8a88] hide_cursor at 8253be1c
  #19 [c95b8aa0] vt_console_print at 82543094
  #20 [c95b8af0] call_console_drivers at 81319b32
  #21 [c95b8b30] console_unlock at 8131bd50
  #22 [c95b8c68] vprintk_emit at 8131e0f5
  #23 [c95b8ca8] vprintk_default at 8131e4ff
  #24 [c95b8cb0] vprintk at 8131eee6
  #25 [c95b8cd0] printk at 81318ce0
  #26 [c95b8d78] __warn_printk at 811c7b9d
  #27 [c95b8e28] enqueue_task_fair at 8129774a  <-- 
SCHED_WARN_ON(rq->tmp_alone_branch != >leaf_cfs_rq_list);
  #28 [c95b8ec0] activate_task at 8125625d
  #29 [c95b8ef0] ttwu_do_activate at 81257943
  #30 [c95b8f28] sched_ttwu_pending at 8125c71f <-- locks this 
CPU's runqueue
  #31 [c95b8fa0] flush_smp_call_function_queue at 813c6833
  #32 [c95b8fd8] generic_smp_call_function_single_interrupt at 
813c7f58
  #33 [c95b8fe0] __sysvec_call_function_single at 810f1456
  #34 [c95b8ff0] sysvec_call_function_single at 831ec1bc
  ---  ---
  #35 [c919fda8] sysvec_call_function_single at 831ec1bc
  RIP: 831ed06e  RSP: ed10438a6a49  RFLAGS: 0001
  RAX: 888100d832c0  RBX:   RCX: 19233fd7
  RDX:   RSI: 888100d832c0  RDI: ed10438a6a49
  RBP: 831ec166   R8: dc00   R9: 
  R10: 83400e22  R11:   R12: 831ed83e
  R13:   R14: c919fde8  R15: 814d4d9d
  ORIG_RAX: 88821c53524b  CS: 0001  SS: ef073a2
  WARNING: possibly bogus exception frame
--->8---

The catalyst is that CONFIG_SCHED_DEBUG is enabled and the tmp_alone_branch
assertion fails (Peter, is this bad?). The current CPU's runqueue is locked when
the warning is printed, and then drm_fb_helper_damage() triggers the deadlock
when it tries to schedule a worker and thus attempts to re-lock the runqueue.

I'm not sure what the *correct* solution is here (don't use printk while having
a runqueue locked? don't use schedule_work() from the fbcon path? tell printk
to use one of its lock-less backends?), so I've cc'd all the relevant folks.

Thanks,
Sultan


Re: [PATCH v4] drm/i915: Synchronize active and retire callbacks

2020-04-22 Thread Sultan Alsawaf
On Tue, Apr 21, 2020 at 09:51:37AM +0300, Joonas Lahtinen wrote:
> Quoting Sultan Alsawaf (2020-04-20 19:15:14)
> > On Mon, Apr 20, 2020 at 11:21:42AM +0300, Joonas Lahtinen wrote:
> > > So it seems that the patch got pulled into v5.6 and has been backported
> > > to v5.5 but not v5.4.
> > 
> > You're right, that's my mistake.
> 
> Did applying the patch to v5.4 fix the issue at hand?

Of course the patch doesn't apply as-is because the code's been shuffled around,
but yes. The crashes are gone with that patch, and I don't have the motivation
to check if that patch is actually correct, so hurray, problem solved. I'm not
going to send the backport myself because I'll probably be ignored, so you can
go ahead and do that.

Sultan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()

2020-04-22 Thread Sultan Alsawaf
On Tue, Apr 21, 2020 at 11:04:13AM +0300, Joonas Lahtinen wrote:
> Quoting Sultan Alsawaf (2020-04-20 18:42:16)
> > On Mon, Apr 20, 2020 at 12:02:39PM +0300, Joonas Lahtinen wrote:
> > > I think the the patch should be dropped for now before the issue is
> > > properly addressed. Either by backporting the mainline fixes or if
> > > those are too big and there indeed is a smaller alternative patch
> > > that is properly reviewed. But the above patch is not, at least yet.
> > 
> > Why should a fix for a bona-fide issue be dropped due to political reasons? 
> > This
> > doesn't make sense to me. This just hurts miserable i915 users even more. 
> > If my
> > patch is going to be dropped, it should be replaced by a different fix at 
> > the
> > same time.
> 
> There's no politics involved. It's all about doing the due diligence
> that we're fixing upstream bugs, and we're fixing them in a way that
> does not cause regressions to other users.

Due diligence is the furthest thing from what I'd associate with i915. My
cherryview laptop doesn't work anymore in 5.6 (it gets many GPU hangs; a RedHat
engineer is working on fixing it according to the gitlab issue), my colleagues'
laptops with 4k screens still suffer from GPU hangs (with fully documented
gitlab reports that have gone completely unanswered), and I still get occasional
FIFO underrun messages in dmesg accompanying graphical glitches on my own
laptops. On top of that, PSR doesn't work correctly on any of my laptops, and
it's enabled by default in i915. But when I reported the issue on gitlab, I got
claims that my laptops' panels were bad, and that my reports were "not enough
reason to disable PSR by default." [1]

[1] https://gitlab.freedesktop.org/drm/intel/-/issues/425#note_434182

> Without being able to reproduce a bug against vanilla kernel, there's
> too high of a risk that the patch that was developed will only work
> on the downstream kernel it was developed for. That happens for the
> best of the developers, and that is exactly why the process is in
> place, to avoid human error. So no politics, just due diligence.

You could have reviewed the patch in the same amount of time it took you to
write this email. It's very simple and obvious. That's why this sounds more like
politics to me than "due diligence."

> > Also, the mainline fixes just *happen* to fix this deadlock by removing the
> > mutex lock from the path in question and creating multiple other bugs in the
> > process that had to be addressed with "Fixes:" commits. The regression 
> > potential
> > was too high to include those patches for a "stable" kernel, so I made this
> > patch which fixes the issue in the simplest way possible.
> 
> The thing is that it may be that the patch fixes the exact issue you
> have at hand in the downstream kernel you are testing against. But
> in doing so it may as well break other usecases for other users of
> vanilla kernel. That is what we're trying to avoid.

I don't know of any usecase that needs a double mutex lock (except for drm code
that uses mutex_trylock_recursive of course), but alright.

> A patch that claims to fix a deadlock in upstream kernel should
> include that splat from upstream kernel, not a speculated chain.
> Again, this is just the regular due diligence, because we have
> made errors in the past. It is for those self-made errors we
> know not to merge fixes too quickly before we are able to
> reproduce the error and make sure it is gone.

I sent the splat in a previous email. And you lot are all making many errors in
the present, far more so than in the past because i915 used to work at some
point. The sheer number of Chris' commits with subsequent "Fixes:" for his
changes is mind numbing. For Ubuntu, we've had to backport a massive amount of
i915 fixes to our 5.4 kernel thus far, and the major bugs still aren't all
fixed (probably since those bugs still exist in Linus' tree). And we can't
backport all of the relevant fixes we find because they're either contained in
massive refactor commits or they only apply to i915 code that's been massively
refactored since the last kernel release.

> > We put this patch into
> > Ubuntu now as well, because praying for a response from i915 maintainers 
> > while
> > the 20.04 release was on the horizon was not an option.
> > 
> > > There is an another similar thread where there's jumping into
> > > conclusions and doing ad-hoc patches for already fixed issues:
> > > 
> > > https://lore.kernel.org/dri-devel/20200414144309.GB2082@sultan-box.localdomain/
> > 
> > Maybe this wouldn't have happened if I had received a proper response for 
> > that
> > issue on gitlab from the

Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()

2020-04-21 Thread Sultan Alsawaf
On Mon, Apr 20, 2020 at 12:02:39PM +0300, Joonas Lahtinen wrote:
> I think the the patch should be dropped for now before the issue is
> properly addressed. Either by backporting the mainline fixes or if
> those are too big and there indeed is a smaller alternative patch
> that is properly reviewed. But the above patch is not, at least yet.

Why should a fix for a bona-fide issue be dropped due to political reasons? This
doesn't make sense to me. This just hurts miserable i915 users even more. If my
patch is going to be dropped, it should be replaced by a different fix at the
same time.

Also, the mainline fixes just *happen* to fix this deadlock by removing the
mutex lock from the path in question and creating multiple other bugs in the
process that had to be addressed with "Fixes:" commits. The regression potential
was too high to include those patches for a "stable" kernel, so I made this
patch which fixes the issue in the simplest way possible. We put this patch into
Ubuntu now as well, because praying for a response from i915 maintainers while
the 20.04 release was on the horizon was not an option.

> There is an another similar thread where there's jumping into
> conclusions and doing ad-hoc patches for already fixed issues:
> 
> https://lore.kernel.org/dri-devel/20200414144309.GB2082@sultan-box.localdomain/

Maybe this wouldn't have happened if I had received a proper response for that
issue on gitlab from the get-go... Instead I got the run-around from Chris
claiming that it wasn't an i915 bug:

https://gitlab.freedesktop.org/drm/intel/issues/1599

> I appreciate enthusiasm to provide fixes to i915 but we should
> continue do the regular due diligence to make sure we're properly
> fixing bugs in upstream kernels. And when fixing them, to make
> sure we're not simply papering over them for a single use case.
> 
> It would be preferred to file a bug for the seen issues,
> describing how to reproduce them with vanilla upstream kernels:
> 
> https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs

gitlab.freedesktop.org/drm/intel is where bugs go to be neglected, as noted
above. I really see no reason to send anything there anymore, when the vast
majority of community-sourced bug reports go ignored. 

Sultan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4] drm/i915: Synchronize active and retire callbacks

2020-04-21 Thread Sultan Alsawaf
On Mon, Apr 20, 2020 at 11:21:42AM +0300, Joonas Lahtinen wrote:
> So it seems that the patch got pulled into v5.6 and has been backported
> to v5.5 but not v5.4.

You're right, that's my mistake.

> In doing that zeroing of ring->vaddr is removed so the test to do mdelay(1)
> and "ring->vaddr = NULL;" is not correct.

I'm not so sure about this. Look at where `ring->vaddr` is assigned:
-8<-
ret = i915_vma_pin(vma, 0, 0, flags);
if (unlikely(ret))
goto err_unpin;

if (i915_vma_is_map_and_fenceable(vma))
addr = (void __force *)i915_vma_pin_iomap(vma);
else
addr = i915_gem_object_pin_map(vma->obj,
   
i915_coherent_map_type(vma->vm->i915));
if (IS_ERR(addr)) {
ret = PTR_ERR(addr);
goto err_ring;
}

i915_vma_make_unshrinkable(vma);

/* Discard any unused bytes beyond that submitted to hw. */
intel_ring_reset(ring, ring->emit);

ring->vaddr = addr;
->8-

And then the converse of that is done *before* my reproducer patch does
`ring->vaddr = NULL;`:
-8<-
i915_vma_unset_ggtt_write(vma);
if (i915_vma_is_map_and_fenceable(vma))
i915_vma_unpin_iomap(vma);
else
i915_gem_object_unpin_map(vma->obj);

/* mdelay(1);
ring->vaddr = NULL; */

i915_vma_make_purgeable(vma);
i915_vma_unpin(vma);
->8-

Isn't the value assigned to `ring->vaddr` trashed by those function calls above
where I've got the mdelay? If so, why would it be correct to let the stale value
sit in `ring->vaddr`?

My interpretation of the zeroing of ring->vaddr being removed by Chris was that
it was an unnecessary step before the ring was getting discarded anyway.

Sultan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4] drm/i915: Synchronize active and retire callbacks

2020-04-20 Thread Sultan Alsawaf
Chris,

Could you please look at this in earnest? This is a real bug that crashes my
laptop without any kind of provocation. It is undeniably a bug in i915, and I've
clearly described it in my patch. If you dont like the patch, I'm open to any
suggestions you have for an alternative solution. My goal here is to make i915
better, but it's difficult when communication only goes one way.

Thanks,
Sultan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4] drm/i915: Synchronize active and retire callbacks

2020-04-15 Thread Sultan Alsawaf
On Tue, Apr 14, 2020 at 09:23:56AM +0100, Chris Wilson wrote:
> Quoting Sultan Alsawaf (2020-04-14 07:13:12)
> > Chris,
> > 
> > Could you please take a look at this? This really is quite an important fix.
> 
> It's crazy. See a266bf420060 for a patch that should be applied to v5.4
> -Chris

What? a266bf420060 was part of 5.4.0-rc7, so it's already in 5.4. And if you
read the commit message, you would see that the problem in question affects
Linus' tree.

You can break i915 in 5.6 by just adding a small delay:

diff --git a/drivers/gpu/drm/i915/gt/intel_ring.c 
b/drivers/gpu/drm/i915/gt/intel_ring.c
index 6ff803f397c4..3a7968effdfd 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring.c
+++ b/drivers/gpu/drm/i915/gt/intel_ring.c
@@ -10,6 +10,7 @@
 #include "intel_engine.h"
 #include "intel_ring.h"
 #include "intel_timeline.h"
+#include 
 
 unsigned int intel_ring_update_space(struct intel_ring *ring)
 {
@@ -92,6 +93,9 @@ void intel_ring_unpin(struct intel_ring *ring)
else
i915_gem_object_unpin_map(vma->obj);
 
+   mdelay(1);
+   ring->vaddr = NULL;
+
i915_vma_make_purgeable(vma);
i915_vma_unpin(vma);
 }

This is how I reproduced the race in question. I can't even reach the greeter on
my laptop with this, because i915 dies before that.

Sultan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/1] drm/i915: Fix ref->mutex deadlock in i915_active_wait()

2020-04-15 Thread Sultan Alsawaf
On Tue, Apr 14, 2020 at 09:13:28AM +0100, Chris Wilson wrote:
> Quoting Sultan Alsawaf (2020-04-07 07:26:22)
> > From: Sultan Alsawaf 
> > 
> > The following deadlock exists in i915_active_wait() due to a double lock
> > on ref->mutex (call chain listed in order from top to bottom):
> >  i915_active_wait();
> >  mutex_lock_interruptible(>mutex); <-- ref->mutex first acquired
> >  i915_active_request_retire();
> >  node_retire();
> >  active_retire();
> >  mutex_lock_nested(>mutex, SINGLE_DEPTH_NESTING); <-- DEADLOCK
> > 
> > Fix the deadlock by skipping the second ref->mutex lock when
> > active_retire() is called through i915_active_request_retire().
> > 
> > Fixes: 12c255b5dad1 ("drm/i915: Provide an i915_active.acquire callback")
> > Cc:  # 5.4.x
> > Signed-off-by: Sultan Alsawaf 
> 
> Incorrect. 
> 
> You missed that it cannot retire from inside the wait due to the active
> reference held on the i915_active for the wait.
> 
> The only point it can enter retire from inside i915_active_wait() is via
> the terminal __active_retire() which releases the mutex in doing so.
> -Chris

The terminal __active_retire() and rbtree_postorder_for_each_entry_safe() loop
retire different objects, so this isn't true.

Sultan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()

2020-04-15 Thread Sultan Alsawaf
On Tue, Apr 14, 2020 at 09:15:07AM +0100, Chris Wilson wrote:
> The patch does not fix a deadlock. Greg, this patch is not a backport of
> a bugfix, why is it in stable?
> -Chris

Here's the deadlock this supposedly doesn't fix:
INFO: task kswapd0:178 blocked for more than 122 seconds.
  Tainted: G U5.4.28-00014-gd1e04f91d2c5 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd0 D0   178  2 0x80004000
Call Trace:
 ? __schedule+0x2f3/0x750
 schedule+0x39/0xa0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.0+0x19b/0x500
 ? i915_request_wait+0x25b/0x370
 active_retire+0x26/0x30
 i915_active_wait+0xa3/0x1a0
 i915_vma_unbind+0xe2/0x1c0
 i915_gem_object_unbind+0x111/0x140
 i915_gem_shrink+0x21b/0x530
 i915_gem_shrinker_scan+0xfd/0x120
 do_shrink_slab+0x154/0x2c0
 shrink_slab+0xd0/0x2f0
 shrink_node+0xdf/0x420
 balance_pgdat+0x2e3/0x540
 kswapd+0x200/0x3c0
 ? __wake_up_common_lock+0xc0/0xc0
 kthread+0xfb/0x130
 ? balance_pgdat+0x540/0x540
 ? __kthread_parkme+0x60/0x60
 ret_from_fork+0x1f/0x40
INFO: task kworker/u32:5:222 blocked for more than 122 seconds.
  Tainted: G U5.4.28-00014-gd1e04f91d2c5 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u32:5   D0   222  2 0x80004000
Workqueue: i915 idle_work_handler
Call Trace:
 ? __schedule+0x2f3/0x750
 schedule+0x39/0xa0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.0+0x19b/0x500
 idle_work_handler+0x34/0x120
 process_one_work+0x1ea/0x3a0
 worker_thread+0x4d/0x3f0
 kthread+0xfb/0x130
 ? process_one_work+0x3a0/0x3a0
 ? __kthread_parkme+0x60/0x60
 ret_from_fork+0x1f/0x40
INFO: task mpv:1535 blocked for more than 122 seconds.
  Tainted: G U5.4.28-00014-gd1e04f91d2c5 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mpv D0  1535  1 0x
Call Trace:
 ? __schedule+0x2f3/0x750
 schedule+0x39/0xa0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.0+0x19b/0x500
 __i915_gem_free_objects+0x68/0x190
 i915_gem_create_ioctl+0x18/0x30
 ? i915_gem_dumb_create+0xa0/0xa0
 drm_ioctl_kernel+0xb2/0x100
 drm_ioctl+0x209/0x360
 ? i915_gem_dumb_create+0xa0/0xa0
 do_vfs_ioctl+0x43f/0x6c0
 ksys_ioctl+0x5e/0x90
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x4e/0x140
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb49f1b32eb
Code: Bad RIP value.
RSP: 002b:7ffef9eb0948 EFLAGS: 0246 ORIG_RAX: 0010
RAX: ffda RBX: 7ffef9eb09c0 RCX: 7fb49f1b32eb
RDX: 7ffef9eb09c0 RSI: c010645b RDI: 0008
RBP: c010645b R08: 55fdb80c1370 R09: 55fdb80c14e0
R10:  R11: 0246 R12: 7fb4781e56b0
R13: 0008 R14: 7fb4781e5560 R15: 7fb4781e56b0
INFO: task kswapd0:178 blocked for more than 245 seconds.
  Tainted: G U5.4.28-00014-gd1e04f91d2c5 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd0 D0   178  2 0x80004000
Call Trace:
 ? __schedule+0x2f3/0x750
 schedule+0x39/0xa0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.0+0x19b/0x500
 ? i915_request_wait+0x25b/0x370
 active_retire+0x26/0x30
 i915_active_wait+0xa3/0x1a0
 i915_vma_unbind+0xe2/0x1c0
 i915_gem_object_unbind+0x111/0x140
 i915_gem_shrink+0x21b/0x530
 i915_gem_shrinker_scan+0xfd/0x120
 do_shrink_slab+0x154/0x2c0
 shrink_slab+0xd0/0x2f0
 shrink_node+0xdf/0x420
 balance_pgdat+0x2e3/0x540
 kswapd+0x200/0x3c0
 ? __wake_up_common_lock+0xc0/0xc0
 kthread+0xfb/0x130
 ? balance_pgdat+0x540/0x540
 ? __kthread_parkme+0x60/0x60
 ret_from_fork+0x1f/0x40
INFO: task kworker/u32:5:222 blocked for more than 245 seconds.
  Tainted: G U5.4.28-00014-gd1e04f91d2c5 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u32:5   D0   222  2 0x80004000
Workqueue: i915 idle_work_handler
Call Trace:
 ? __schedule+0x2f3/0x750
 schedule+0x39/0xa0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.0+0x19b/0x500
 idle_work_handler+0x34/0x120
 process_one_work+0x1ea/0x3a0
 worker_thread+0x4d/0x3f0
 kthread+0xfb/0x130
 ? process_one_work+0x3a0/0x3a0
 ? __kthread_parkme+0x60/0x60
 ret_from_fork+0x1f/0x40
INFO: task mpv:1535 blocked for more than 245 seconds.
  Tainted: G U5.4.28-00014-gd1e04f91d2c5 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mpv D0  1535  1 0x
Call Trace:
 ? __schedule+0x2f3/0x750
 schedule+0x39/0xa0
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.0+0x19b/0x500
 __i915_gem_free_objects+0x68/0x190
 i915_gem_create_ioctl+0x18/0x30
 ? i915_gem_dumb_create+0xa0/0xa0
 drm_ioctl_kernel+0xb2/0x100
 drm_ioctl+0x209/0x360
 ? i915_gem_dumb_create+0xa0/0xa0
 do_vfs_ioctl+0x43f/0x6c0
 ksys_ioctl+0x5e/0x90
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x4e/0x140
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb49f1b32eb

Re: [PATCH v4] drm/i915: Synchronize active and retire callbacks

2020-04-14 Thread Sultan Alsawaf
Chris,

Could you please take a look at this? This really is quite an important fix.

Thanks,
Sultan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()

2020-04-11 Thread Sultan Alsawaf
On Fri, Apr 10, 2020 at 11:08:38AM +0200, Greg KH wrote:
> On Tue, Apr 07, 2020 at 12:18:09AM -0700, Sultan Alsawaf wrote:
> > From: Sultan Alsawaf 
> > 
> > The following deadlock exists in i915_active_wait() due to a double lock
> > on ref->mutex (call chain listed in order from top to bottom):
> >  i915_active_wait();
> >  mutex_lock_interruptible(>mutex); <-- ref->mutex first acquired
> >  i915_active_request_retire();
> >  node_retire();
> >  active_retire();
> >  mutex_lock_nested(>mutex, SINGLE_DEPTH_NESTING); <-- DEADLOCK
> > 
> > Fix the deadlock by skipping the second ref->mutex lock when
> > active_retire() is called through i915_active_request_retire().
> > 
> > Note that this bug only affects 5.4 and has since been fixed in 5.5.
> > Normally, a backport of the fix from 5.5 would be in order, but the
> > patch set that fixes this deadlock involves massive changes that are
> > neither feasible nor desirable for backporting [1][2][3]. Therefore,
> > this small patch was made to address the deadlock specifically for 5.4.
> > 
> > [1] 274cbf20fd10 ("drm/i915: Push the i915_active.retire into a worker")
> > [2] 093b92287363 ("drm/i915: Split i915_active.mutex into an irq-safe 
> > spinlock for the rbtree")
> > [3] 750bde2fd4ff ("drm/i915: Serialise with remote retirement")
> > 
> > Fixes: 12c255b5dad1 ("drm/i915: Provide an i915_active.acquire callback")
> > Cc:  # 5.4.x
> > Signed-off-by: Sultan Alsawaf 
> > ---
> >  drivers/gpu/drm/i915/i915_active.c | 27 +++
> >  drivers/gpu/drm/i915/i915_active.h |  4 ++--
> >  2 files changed, 25 insertions(+), 6 deletions(-)
> 
> Now queued up, thanks.
> 
> greg k-h

Hi Greg,

Thanks for queuing this! However, you queued the v1 version of the patch instead
of v2. Please pick the v2 version instead.

Thanks,
Sultan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2] drm/i915: Fix ref->mutex deadlock in i915_active_wait()

2020-04-11 Thread Sultan Alsawaf
On Fri, Apr 10, 2020 at 11:08:38AM +0200, Greg KH wrote:
> On Tue, Apr 07, 2020 at 12:18:09AM -0700, Sultan Alsawaf wrote:
> > From: Sultan Alsawaf 
> > 
> > The following deadlock exists in i915_active_wait() due to a double lock
> > on ref->mutex (call chain listed in order from top to bottom):
> >  i915_active_wait();
> >  mutex_lock_interruptible(>mutex); <-- ref->mutex first acquired
> >  i915_active_request_retire();
> >  node_retire();
> >  active_retire();
> >  mutex_lock_nested(>mutex, SINGLE_DEPTH_NESTING); <-- DEADLOCK
> > 
> > Fix the deadlock by skipping the second ref->mutex lock when
> > active_retire() is called through i915_active_request_retire().
> > 
> > Note that this bug only affects 5.4 and has since been fixed in 5.5.
> > Normally, a backport of the fix from 5.5 would be in order, but the
> > patch set that fixes this deadlock involves massive changes that are
> > neither feasible nor desirable for backporting [1][2][3]. Therefore,
> > this small patch was made to address the deadlock specifically for 5.4.
> > 
> > [1] 274cbf20fd10 ("drm/i915: Push the i915_active.retire into a worker")
> > [2] 093b92287363 ("drm/i915: Split i915_active.mutex into an irq-safe 
> > spinlock for the rbtree")
> > [3] 750bde2fd4ff ("drm/i915: Serialise with remote retirement")
> > 
> > Fixes: 12c255b5dad1 ("drm/i915: Provide an i915_active.acquire callback")
> > Cc:  # 5.4.x
> > Signed-off-by: Sultan Alsawaf 
> > ---
> >  drivers/gpu/drm/i915/i915_active.c | 27 +++
> >  drivers/gpu/drm/i915/i915_active.h |  4 ++--
> >  2 files changed, 25 insertions(+), 6 deletions(-)
> 
> Now queued up, thanks.
> 
> greg k-h

I'm sorry, I meant the v3 [1]. Apologies for the confusion. The v3 was picked
into Ubuntu so that's what we're rolling with.

Thanks,
Sultan

[1] https://lore.kernel.org/stable/20200407203222.2493-1-sul...@kerneltoast.com
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v3] drm/i915: Synchronize active and retire callbacks

2020-04-04 Thread Sultan Alsawaf
On Fri, Apr 03, 2020 at 03:35:15PM -0700, Sultan Alsawaf wrote:
> + ref->retire(ref);
> + mutex_unlock(>callback_lock);

Ugh, this patch is still wrong because the mutex unlock after ref->retire() is a
use-after-free. Fun times...

Sultan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH 2/2] drm/i915/gt: Schedule request retirement when timeline idles

2020-03-31 Thread Sultan Alsawaf
From: Chris Wilson 

commit 311770173fac27845a3a83e2c16100a54d308f72 upstream.

The major drawback of commit 7e34f4e4aad3 ("drm/i915/gen8+: Add RC6 CTX
corruption WA") is that it disables RC6 while Skylake (and friends) is
active, and we do not consider the GPU idle until all outstanding
requests have been retired and the engine switched over to the kernel
context. If userspace is idle, this task falls onto our background idle
worker, which only runs roughly once a second, meaning that userspace has
to have been idle for a couple of seconds before we enable RC6 again.
Naturally, this causes us to consume considerably more energy than
before as powersaving is effectively disabled while a display server
(here's looking at you Xorg) is running.

As execlists will get a completion event as each context is completed,
we can use this interrupt to queue a retire worker bound to this engine
to cleanup idle timelines. We will then immediately notice the idle
engine (without userspace intervention or the aid of the background
retire worker) and start parking the GPU. Thus during light workloads,
we will do much more work to idle the GPU faster...  Hopefully with
commensurate power saving!

v2: Watch context completions and only look at those local to the engine
when retiring to reduce the amount of excess work we perform.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=112315
References: 7e34f4e4aad3 ("drm/i915/gen8+: Add RC6 CTX corruption WA")
References: 2248a28384fe ("drm/i915/gen8+: Add RC6 CTX corruption WA")
Signed-off-by: Chris Wilson 
Cc: Tvrtko Ursulin 
Reviewed-by: Tvrtko Ursulin 
Link: 
https://patchwork.freedesktop.org/patch/msgid/20191125105858.1718307-3-ch...@chris-wilson.co.uk
(cherry picked from commit 4f88f8747fa43c97c3b3712d8d87295ea757cc51)
Signed-off-by: Joonas Lahtinen 
Cc:  # 5.4.x
Signed-off-by: Sultan Alsawaf 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  2 +
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  8 ++
 drivers/gpu/drm/i915/gt/intel_lrc.c   |  8 ++
 drivers/gpu/drm/i915/gt/intel_timeline.c  |  1 +
 .../gpu/drm/i915/gt/intel_timeline_types.h|  3 +
 drivers/gpu/drm/i915/i915_request.c   | 73 +++
 drivers/gpu/drm/i915/i915_request.h   |  4 +
 7 files changed, 99 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 4ce8626b140e..f732a2177cd0 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -600,6 +600,7 @@ static int intel_engine_setup_common(struct intel_engine_cs 
*engine)
intel_engine_init_hangcheck(engine);
intel_engine_init_cmd_parser(engine);
intel_engine_init__pm(engine);
+   intel_engine_init_retire(engine);
 
intel_engine_pool_init(>pool);
 
@@ -807,6 +808,7 @@ void intel_engine_cleanup_common(struct intel_engine_cs 
*engine)
 
cleanup_status_page(engine);
 
+   intel_engine_fini_retire(engine);
intel_engine_pool_fini(>pool);
intel_engine_fini_breadcrumbs(engine);
intel_engine_cleanup_cmd_parser(engine);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index c77c9518c58b..1eb7189f88e1 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -471,6 +471,14 @@ struct intel_engine_cs {
 
struct intel_engine_execlists execlists;
 
+   /*
+* Keep track of completed timelines on this engine for early
+* retirement with the goal of quickly enabling powersaving as
+* soon as the engine is idle.
+*/
+   struct intel_timeline *retire;
+   struct work_struct retire_work;
+
/* status_notifier: list of callbacks for context-switch changes */
struct atomic_notifier_head context_status_notifier;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 66f6d1a897f2..a1538c8f7922 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -606,6 +606,14 @@ __execlists_schedule_out(struct i915_request *rq,
 {
struct intel_context * const ce = rq->hw_context;
 
+   /*
+* If we have just completed this context, the engine may now be
+* idle and we want to re-enter powersaving.
+*/
+   if (list_is_last(>link, >timeline->requests) &&
+   i915_request_completed(rq))
+   intel_engine_add_retire(engine, ce->timeline);
+
intel_engine_context_out(engine);
execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_OUT);
intel_gt_pm_put(engine->gt);
diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c 
b/drivers/gpu/drm/i915/gt/intel_timeline.c
index 9cb01d9828f1..63515e3caaf2 100644
--- a/drivers/gpu/drm/i915/gt/intel_timeline.c
+++ b/dri

[PATCH 0/2] A couple of important fixes for i915 on 5.4

2020-03-31 Thread Sultan Alsawaf
From: Sultan Alsawaf 

Hi,

This patchset contains fixes for two pesky i915 bugs that exist in 5.4.

The first bug, fixed by
"drm/i915/gt: Schedule request retirement when timeline idles" upstream, is very
high power consumption by i915 hardware due to an old commit that broke the RC6
power state for the iGPU on Skylake+. On my laptop, a Dell Precision 5540 with
an i9-9880H, the idle power consumption of my laptop with this commit goes down
from 10 W to 2 W, saving a massive 8 W of power. On other Skylake+ laptops I
tested, this commit reduced idle power consumption by at least a few watts. The
difference in power consumption can be observed by running `powertop` while
disconnected from the charger, or by using the intel-undervolt tool [1] and
running `intel-undervolt measure` which doesn't require being disconnected from
the charger. The psys measurement from intel-undervolt is the one of interest.

Backporting this commit was not trivial due to i915's high rate of churn, but
the backport didn't require changing any code from the original commit; it only
required moving code around and not altering some #includes.

The second bug causes severe graphical corruption and flickering on laptops
which support an esoteric power-saving mechanism called Panel Self Refresh
(PSR). Enabled by default in 5.0, PSR causes graphical corruption to the point
of unusability on many Dell laptops made within the past few years, since they
typically support PSR. A bug was filed with Intel a while ago for this with more
information [2]. I suspect most the community hasn't been affected by this bug
because ThinkPads and many other laptops I checked didn't support PSR. As of
writing, the issues I observed with PSR are fixed in Intel's drm-tip branch, but
i915 suffers from so much churn that backporting the individual PSR fixes is
infeasible (and an Intel engineer attested to this, saying that the PSR fixes in
drm-tip wouldn't be backported [3]). Disabling PSR by default brings us back to
pre-5.0 behavior, and from my tests with functional PSR in the drm-tip kernel,
I didn't observe any reduction in power consumption with PSR enabled, so there
isn't much lost from turning it off. Also, Ubuntu now ships with PSR disabled by
default in its affected kernels, so there is distro support behind this change.

Thanks,
Sultan

[1] https://github.com/kitsunyan/intel-undervolt
[2] https://gitlab.freedesktop.org/drm/intel/issues/425
[3] https://gitlab.freedesktop.org/drm/intel/issues/425#note_416130

Chris Wilson (1):
  drm/i915/gt: Schedule request retirement when timeline idles

Sultan Alsawaf (1):
  drm/i915: Disable Panel Self Refresh (PSR) by default

 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  2 +
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  8 ++
 drivers/gpu/drm/i915/gt/intel_lrc.c   |  8 ++
 drivers/gpu/drm/i915/gt/intel_timeline.c  |  1 +
 .../gpu/drm/i915/gt/intel_timeline_types.h|  3 +
 drivers/gpu/drm/i915/i915_params.c|  3 +-
 drivers/gpu/drm/i915/i915_params.h|  2 +-
 drivers/gpu/drm/i915/i915_request.c   | 73 +++
 drivers/gpu/drm/i915/i915_request.h   |  4 +
 9 files changed, 101 insertions(+), 3 deletions(-)

-- 
2.26.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/2] drm/i915/gt: Schedule request retirement when timeline idles

2020-03-31 Thread Sultan Alsawaf
CC'ing relevant folks. My apologies.

Sultan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 0/2] A couple of important fixes for i915 on 5.4

2020-03-31 Thread Sultan Alsawaf
On Mon, Mar 30, 2020 at 10:51:28AM +0200, Greg KH wrote:
> On Sun, Mar 29, 2020 at 08:30:55PM -0700, Sultan Alsawaf wrote:
> > From: Sultan Alsawaf 
> > 
> > Hi,
> > 
> > This patchset contains fixes for two pesky i915 bugs that exist in 5.4.
> 
> Any reason you didn't also cc: the developers involved in these patches,
> and maintainers?
> 
> Please resend and do that.
> 
> thanks,
> 
> greg k-h

CC'd. Sorry about that.

Sultan
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel