Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

2019-09-12 Thread Chris Wilson
Quoting Linus Torvalds (2019-09-12 12:59:25)
> On Thu, Sep 12, 2019 at 12:51 PM Martin Wilck  wrote:
> >
> > Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr:
> > Acquire the page lock around set_page_dirty()")? And if we do, what
> > would be the consequences? Would other patches need to be reverted,
> > too?
> 
> Looking at that commit, and the backtrace of the lockup, I think that
> reverting it is the correct thing to do.
> 
> You can't take the page lock in invalidate_range(), since it's called
> from try_to_unmap(), which is called with the page lock already held.
> 
> So commit aa56a292ce62 is just fundamentally completely wrong and
> should be reverted.

There's still the dilemma that we get called without the page lock, but
at this moment in time in order to hit 5.3, it needs a revert sent
directly to Linus.
-Chris


Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

2019-09-12 Thread Martin Wilck
Hi Chris,

On Tue, 2019-09-10 at 17:20 +0300, Leho Kraav wrote:
> On Fri, Aug 09, 2019 at 01:53:43PM +0100, Chris Wilson wrote:
> > Quoting Martin Wilck (2019-08-09 13:41:42)
> > > This happened to me today, running kernel 5.3.0-rc3-1.g571863b-
> > > default
> > > (5.3-rc3 with just a few patches on top), after starting a KVM
> > > virtual
> > > machine. The X screen was frozen. Remote login via ssh was still
> > > possible, thus I was able to retrieve basic logs.
> > > 
> > > sysrq-w showed two blocked processes (kcompactd0 and KVM). After
> > > a
> > > minute, the same two processes were still blocked. KVM seems to
> > > try to
> > > acquire a lock that kcompactd is holding. kcompactd is waiting
> > > for IO
> > > to complete on pages owned by the i915 driver.
> > 
> > My bad, it's known. We haven't decided on whether to revert the
> > unfortunate recursive locking (and so hit another warn elsewhere)
> > or to
> > ignore the dirty pages (and so risk losing data across swap).
> > 
> > cb6d7c7dc7ff ("drm/i915/userptr: Acquire the page lock around
> > set_page_dirty()")
> > -Chris
> 
> Hi Chris. Is this exactly what I'm hitting at
> https://bugs.freedesktop.org/show_bug.cgi?id=111500 perhaps?
> 
> It reliably breaks the graphics userland, as the machine consistently
> freezes at any random moment.
> 
> Any workaround options, even if with a performance penalty? Revert
> cb6d7c7dc7ff but side effects?
> 
> 5.3 has useful NVMe power mgmt updates for laptops, I'd like to stick
> with the newest if possible.

There's a considerable risk that many users will start seeing this
regression when 5.3 is released. I am not aware of a workaround.

Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr:
Acquire the page lock around set_page_dirty()")? And if we do, what
would be the consequences? Would other patches need to be reverted,
too?

Thanks,
Martin



Re: [Intel-gfx] 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

2019-09-12 Thread Linus Torvalds
On Thu, Sep 12, 2019 at 12:51 PM Martin Wilck  wrote:
>
> Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr:
> Acquire the page lock around set_page_dirty()")? And if we do, what
> would be the consequences? Would other patches need to be reverted,
> too?

Looking at that commit, and the backtrace of the lockup, I think that
reverting it is the correct thing to do.

You can't take the page lock in invalidate_range(), since it's called
from try_to_unmap(), which is called with the page lock already held.

So commit aa56a292ce62 is just fundamentally completely wrong and
should be reverted.

   Linus
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

2019-09-12 Thread l...@kraav.com
On Thu, Sep 12, 2019 at 11:23:09AM +, Martin Wilck wrote:
> 
> There's a considerable risk that many users will start seeing this
> regression when 5.3 is released. I am not aware of a workaround.
> 
> Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr:
> Acquire the page lock around set_page_dirty()")? And if we do, what
> would be the consequences? Would other patches need to be reverted,
> too?

I've been running with revert patch for a couple of days and have not
encountered any kernel warnings thus far, nor any other ill effects that
could be attributed to this locking mechanism.

But I'm far from familiar with these subsystems.

Graphics does not hang anymore.

I've also received developer feedback in private that this really should
be fixed before 5.3 release.

-- 
Leho Kraav, senior technology & digital marketing architect


Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

2019-09-10 Thread Leho Kraav
On Fri, Aug 09, 2019 at 01:53:43PM +0100, Chris Wilson wrote:
> Quoting Martin Wilck (2019-08-09 13:41:42)
> > This happened to me today, running kernel 5.3.0-rc3-1.g571863b-default
> > (5.3-rc3 with just a few patches on top), after starting a KVM virtual
> > machine. The X screen was frozen. Remote login via ssh was still
> > possible, thus I was able to retrieve basic logs.
> > 
> > sysrq-w showed two blocked processes (kcompactd0 and KVM). After a
> > minute, the same two processes were still blocked. KVM seems to try to
> > acquire a lock that kcompactd is holding. kcompactd is waiting for IO
> > to complete on pages owned by the i915 driver.
> 
> My bad, it's known. We haven't decided on whether to revert the
> unfortunate recursive locking (and so hit another warn elsewhere) or to
> ignore the dirty pages (and so risk losing data across swap).
> 
> cb6d7c7dc7ff ("drm/i915/userptr: Acquire the page lock around 
> set_page_dirty()")
> -Chris

Hi Chris. Is this exactly what I'm hitting at
https://bugs.freedesktop.org/show_bug.cgi?id=111500 perhaps?

It reliably breaks the graphics userland, as the machine consistently
freezes at any random moment.

Any workaround options, even if with a performance penalty? Revert
cb6d7c7dc7ff but side effects?

5.3 has useful NVMe power mgmt updates for laptops, I'd like to stick
with the newest if possible.

-- 
Leho Kraav, senior technology & digital marketing architect


[Intel-gfx] ✗ Fi.CI.BAT: failure for 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

2019-08-09 Thread Patchwork
== Series Details ==

Series: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
URL   : https://patchwork.freedesktop.org/series/64983/
State : failure

== Summary ==

Applying: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
error: sha1 information is lacking or useless 
(drivers/gpu/drm/i915/gem/i915_gem_userptr.c).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch' to see the failed patch
Patch failed at 0001 5.3-rc3: Frozen graphics with kcompactd migrating i915 
pages
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

2019-08-09 Thread Hillf Danton

[off topic: plain text mail please]

On Fri, 9 Aug 2019 12:41:42 + Martin Wilck wrote:
> 
> This happened to me today, running kernel 5.3.0-rc3-1.g571863b-default
> (5.3-rc3 with just a few patches on top), after starting a KVM virtual
> machine. The X screen was frozen. Remote login via ssh was still
> possible, thus I was able to retrieve basic logs.

Thanks for report.
> 
> sysrq-w showed two blocked processes (kcompactd0 and KVM). After a
> minute, the same two processes were still blocked. KVM seems to try to
> acquire a lock that kcompactd is holding. kcompactd is waiting for IO
> to complete on pages owned by the i915 driver.
> 
> kcompactd stack:
> 
> Aug 09 12:12:48 apollon.suse.de kernel: sysrq: Show Blocked State
> Aug 09 12:12:48 apollon.suse.de kernel: taskPC stack  
>  pid father
> Aug 09 12:12:48 apollon.suse.de kernel: kcompactd0  D043  2 
> 0x80004000
> Aug 09 12:12:48 apollon.suse.de kernel: Call Trace:
> Aug 09 12:12:48 apollon.suse.de kernel:  ? __schedule+0x2af/0x6a0
> Aug 09 12:12:48 apollon.suse.de kernel:  schedule+0x33/0x90
> Aug 09 12:12:48 apollon.suse.de kernel:  io_schedule+0x12/0x40
> Aug 09 12:12:48 apollon.suse.de kernel:  __lock_page+0x123/0x200
> Aug 09 12:12:48 apollon.suse.de kernel:  ? gen8_ppgtt_clear_pdp+0xc0/0x140 
> [i915]
> Aug 09 12:12:48 apollon.suse.de kernel:  ? file_fdatawait_range+0x20/0x20
> Aug 09 12:12:48 apollon.suse.de kernel:  set_page_dirty_lock+0x49/0x50
> Aug 09 12:12:48 apollon.suse.de kernel:  
> i915_gem_userptr_put_pages+0x13f/0x1c0 [i915]

The two lines above show commit aa56a292ce62 ("drm/i915/userptr: Acquire
the page lock around set_page_dirty()") is culprit.

> Aug 09 12:12:48 apollon.suse.de kernel:  
> __i915_gem_object_put_pages+0x5e/0xa0 [i915]
> Aug 09 12:12:48 apollon.suse.de kernel:  
> userptr_mn_invalidate_range_start+0x1ff/0x220 [i915]
> Aug 09 12:12:48 apollon.suse.de kernel:  
> __mmu_notifier_invalidate_range_start+0x57/0xa0
> Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap_one+0xa0b/0xae0
> Aug 09 12:12:48 apollon.suse.de kernel:  ? __mod_lruvec_state+0x3f/0xf0
> Aug 09 12:12:48 apollon.suse.de kernel:  rmap_walk_file+0xf2/0x250
> Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap+0xa6/0xe0

Page is locked before try_to_unmap(), and dirty page table entry is
handled in try_to_unmap_one(), so what was added in aa56a292ce62 is
a bit of overaction in this call trace. A bigger pain is it can not
be reverted because of the Fixes tag in it.

> Aug 09 12:12:48 apollon.suse.de kernel:  ? page_remove_rmap+0x290/0x290
> Aug 09 12:12:48 apollon.suse.de kernel:  ? page_not_mapped+0x20/0x20
> Aug 09 12:12:48 apollon.suse.de kernel:  ? page_get_anon_vma+0x80/0x80
> Aug 09 12:12:48 apollon.suse.de kernel:  migrate_pages+0x8cd/0xbc0
> Aug 09 12:12:48 apollon.suse.de kernel:  ? fast_isolate_freepages+0x6b0/0x6b0
> Aug 09 12:12:48 apollon.suse.de kernel:  ? move_freelist_tail+0xb0/0xb0
> Aug 09 12:12:48 apollon.suse.de kernel:  compact_zone+0x669/0xc80
> Aug 09 12:12:48 apollon.suse.de kernel:  ? 
> entry_SYSCALL_64_after_hwframe+0xb8/0xbe
> Aug 09 12:12:48 apollon.suse.de kernel:  kcompactd_do_work+0x120/0x290
> 
> KVM stack:
> 
> Aug 09 12:12:48 apollon.suse.de kernel: CPU 0/KVM   D0 25189  1 
> 0x0320
> Aug 09 12:12:48 apollon.suse.de kernel: Call Trace:
> Aug 09 12:12:48 apollon.suse.de kernel:  ? __schedule+0x2af/0x6a0
> Aug 09 12:12:48 apollon.suse.de kernel:  schedule+0x33/0x90
> Aug 09 12:12:48 apollon.suse.de kernel:  schedule_preempt_disabled+0xa/0x10
> Aug 09 12:12:48 apollon.suse.de kernel:  __mutex_lock.isra.0+0x172/0x4d0
> Aug 09 12:12:48 apollon.suse.de kernel:  
> userptr_mn_invalidate_range_start+0x1bf/0x220 [i915]
> Aug 09 12:12:48 apollon.suse.de kernel:  
> __mmu_notifier_invalidate_range_start+0x57/0xa0
> Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap_one+0xa0b/0xae0
> Aug 09 12:12:48 apollon.suse.de kernel:  rmap_walk_file+0xf2/0x250
> Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap+0xa6/0xe0
> Aug 09 12:12:48 apollon.suse.de kernel:  ? page_remove_rmap+0x290/0x290
> Aug 09 12:12:48 apollon.suse.de kernel:  ? page_not_mapped+0x20/0x20
> Aug 09 12:12:48 apollon.suse.de kernel:  ? page_get_anon_vma+0x80/0x80
> Aug 09 12:12:48 apollon.suse.de kernel:  migrate_pages+0x8cd/0xbc0
> Aug 09 12:12:48 apollon.suse.de kernel:  ? fast_isolate_freepages+0x6b0/0x6b0
> Aug 09 12:12:48 apollon.suse.de kernel:  ? move_freelist_tail+0xb0/0xb0
> Aug 09 12:12:48 apollon.suse.de kernel:  compact_zone+0x669/0xc80
> Aug 09 12:12:48 apollon.suse.de kernel:  compact_zone_order+0xc6/0xf0
> Aug 09 12:12:48 apollon.suse.de kernel:  try_to_compact_pages+0xcc/0x2a0
> Aug 09 12:12:48 apollon.suse.de kernel:  
> __alloc_pages_direct_compact+0x7c/0x150
> Aug 09 12:12:48 apollon.suse.de kernel:  __alloc_pages_slowpath+0x1ee/0xd00
> Aug 09 12:12:48 apollon.suse.de kernel:  ? vmx_vcpu_load+0x100/0x120 
> [kvm_intel]
> 
> Full logs can be found under https://pastebin.com/KJ6tccj

Re: [Intel-gfx] 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

2019-08-09 Thread Chris Wilson
Quoting Martin Wilck (2019-08-09 13:41:42)
> This happened to me today, running kernel 5.3.0-rc3-1.g571863b-default
> (5.3-rc3 with just a few patches on top), after starting a KVM virtual
> machine. The X screen was frozen. Remote login via ssh was still
> possible, thus I was able to retrieve basic logs.
> 
> sysrq-w showed two blocked processes (kcompactd0 and KVM). After a
> minute, the same two processes were still blocked. KVM seems to try to
> acquire a lock that kcompactd is holding. kcompactd is waiting for IO
> to complete on pages owned by the i915 driver.

My bad, it's known. We haven't decided on whether to revert the
unfortunate recursive locking (and so hit another warn elsewhere) or to
ignore the dirty pages (and so risk losing data across swap).

cb6d7c7dc7ff ("drm/i915/userptr: Acquire the page lock around set_page_dirty()")
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

2019-08-09 Thread Martin Wilck
This happened to me today, running kernel 5.3.0-rc3-1.g571863b-default
(5.3-rc3 with just a few patches on top), after starting a KVM virtual
machine. The X screen was frozen. Remote login via ssh was still
possible, thus I was able to retrieve basic logs.

sysrq-w showed two blocked processes (kcompactd0 and KVM). After a
minute, the same two processes were still blocked. KVM seems to try to
acquire a lock that kcompactd is holding. kcompactd is waiting for IO
to complete on pages owned by the i915 driver.

kcompactd stack:

Aug 09 12:12:48 apollon.suse.de kernel: sysrq: Show Blocked State
Aug 09 12:12:48 apollon.suse.de
kernel:   taskPC stack   pid father
Aug 09 12:12:48 apollon.suse.de kernel:
kcompactd0  D043  2 0x80004000
Aug 09 12:12:48 apollon.suse.de kernel: Call Trace:
Aug 09 12:12:48 apollon.suse.de kernel:  ? __schedule+0x2af/0x6a0
Aug 09 12:12:48 apollon.suse.de kernel:  schedule+0x33/0x90
Aug 09 12:12:48 apollon.suse.de kernel:  io_schedule+0x12/0x40
Aug 09 12:12:48 apollon.suse.de kernel:  __lock_page+0x123/0x200
Aug 09 12:12:48 apollon.suse.de kernel:  ?
gen8_ppgtt_clear_pdp+0xc0/0x140 [i915]
Aug 09 12:12:48 apollon.suse.de kernel:  ?
file_fdatawait_range+0x20/0x20
Aug 09 12:12:48 apollon.suse.de kernel:  set_page_dirty_lock+0x49/0x50
Aug 09 12:12:48 apollon.suse.de
kernel:  i915_gem_userptr_put_pages+0x13f/0x1c0 [i915]
Aug 09 12:12:48 apollon.suse.de
kernel:  __i915_gem_object_put_pages+0x5e/0xa0 [i915]
Aug 09 12:12:48 apollon.suse.de
kernel:  userptr_mn_invalidate_range_start+0x1ff/0x220 [i915]
Aug 09 12:12:48 apollon.suse.de
kernel:  __mmu_notifier_invalidate_range_start+0x57/0xa0
Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap_one+0xa0b/0xae0
Aug 09 12:12:48 apollon.suse.de kernel:  ? __mod_lruvec_state+0x3f/0xf0
Aug 09 12:12:48 apollon.suse.de kernel:  rmap_walk_file+0xf2/0x250
Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap+0xa6/0xe0
Aug 09 12:12:48 apollon.suse.de kernel:  ? page_remove_rmap+0x290/0x290
Aug 09 12:12:48 apollon.suse.de kernel:  ? page_not_mapped+0x20/0x20
Aug 09 12:12:48 apollon.suse.de kernel:  ? page_get_anon_vma+0x80/0x80
Aug 09 12:12:48 apollon.suse.de kernel:  migrate_pages+0x8cd/0xbc0
Aug 09 12:12:48 apollon.suse.de kernel:  ?
fast_isolate_freepages+0x6b0/0x6b0
Aug 09 12:12:48 apollon.suse.de kernel:  ? move_freelist_tail+0xb0/0xb0
Aug 09 12:12:48 apollon.suse.de kernel:  compact_zone+0x669/0xc80
Aug 09 12:12:48 apollon.suse.de kernel:  ?
entry_SYSCALL_64_after_hwframe+0xb8/0xbe
Aug 09 12:12:48 apollon.suse.de kernel:  kcompactd_do_work+0x120/0x290


KVM stack:

Aug 09 12:12:48 apollon.suse.de kernel: CPU 0/KVM   D0
25189  1 0x0320
Aug 09 12:12:48 apollon.suse.de kernel: Call Trace:
Aug 09 12:12:48 apollon.suse.de kernel:  ? __schedule+0x2af/0x6a0
Aug 09 12:12:48 apollon.suse.de kernel:  schedule+0x33/0x90
Aug 09 12:12:48 apollon.suse.de
kernel:  schedule_preempt_disabled+0xa/0x10
Aug 09 12:12:48 apollon.suse.de
kernel:  __mutex_lock.isra.0+0x172/0x4d0
Aug 09 12:12:48 apollon.suse.de
kernel:  userptr_mn_invalidate_range_start+0x1bf/0x220 [i915]
Aug 09 12:12:48 apollon.suse.de
kernel:  __mmu_notifier_invalidate_range_start+0x57/0xa0
Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap_one+0xa0b/0xae0
Aug 09 12:12:48 apollon.suse.de kernel:  rmap_walk_file+0xf2/0x250
Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap+0xa6/0xe0
Aug 09 12:12:48 apollon.suse.de kernel:  ? page_remove_rmap+0x290/0x290
Aug 09 12:12:48 apollon.suse.de kernel:  ? page_not_mapped+0x20/0x20
Aug 09 12:12:48 apollon.suse.de kernel:  ? page_get_anon_vma+0x80/0x80
Aug 09 12:12:48 apollon.suse.de kernel:  migrate_pages+0x8cd/0xbc0
Aug 09 12:12:48 apollon.suse.de kernel:  ?
fast_isolate_freepages+0x6b0/0x6b0
Aug 09 12:12:48 apollon.suse.de kernel:  ? move_freelist_tail+0xb0/0xb0
Aug 09 12:12:48 apollon.suse.de kernel:  compact_zone+0x669/0xc80
Aug 09 12:12:48 apollon.suse.de kernel:  compact_zone_order+0xc6/0xf0
Aug 09 12:12:48 apollon.suse.de
kernel:  try_to_compact_pages+0xcc/0x2a0
Aug 09 12:12:48 apollon.suse.de
kernel:  __alloc_pages_direct_compact+0x7c/0x150
Aug 09 12:12:48 apollon.suse.de
kernel:  __alloc_pages_slowpath+0x1ee/0xd00
Aug 09 12:12:48 apollon.suse.de kernel:  ? vmx_vcpu_load+0x100/0x120
[kvm_intel]

Full logs can be found under https://pastebin.com/KJ6tccj4
I haven't yet tried if this is reproducible.

Regards
Martin