Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
Quoting Linus Torvalds (2019-09-12 12:59:25) > On Thu, Sep 12, 2019 at 12:51 PM Martin Wilck wrote: > > > > Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr: > > Acquire the page lock around set_page_dirty()")? And if we do, what > > would be the consequences? Would other patches need to be reverted, > > too? > > Looking at that commit, and the backtrace of the lockup, I think that > reverting it is the correct thing to do. > > You can't take the page lock in invalidate_range(), since it's called > from try_to_unmap(), which is called with the page lock already held. > > So commit aa56a292ce62 is just fundamentally completely wrong and > should be reverted. There's still the dilemma that we get called without the page lock, but at this moment in time in order to hit 5.3, it needs a revert sent directly to Linus. -Chris
Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
Hi Chris, On Tue, 2019-09-10 at 17:20 +0300, Leho Kraav wrote: > On Fri, Aug 09, 2019 at 01:53:43PM +0100, Chris Wilson wrote: > > Quoting Martin Wilck (2019-08-09 13:41:42) > > > This happened to me today, running kernel 5.3.0-rc3-1.g571863b- > > > default > > > (5.3-rc3 with just a few patches on top), after starting a KVM > > > virtual > > > machine. The X screen was frozen. Remote login via ssh was still > > > possible, thus I was able to retrieve basic logs. > > > > > > sysrq-w showed two blocked processes (kcompactd0 and KVM). After > > > a > > > minute, the same two processes were still blocked. KVM seems to > > > try to > > > acquire a lock that kcompactd is holding. kcompactd is waiting > > > for IO > > > to complete on pages owned by the i915 driver. > > > > My bad, it's known. We haven't decided on whether to revert the > > unfortunate recursive locking (and so hit another warn elsewhere) > > or to > > ignore the dirty pages (and so risk losing data across swap). > > > > cb6d7c7dc7ff ("drm/i915/userptr: Acquire the page lock around > > set_page_dirty()") > > -Chris > > Hi Chris. Is this exactly what I'm hitting at > https://bugs.freedesktop.org/show_bug.cgi?id=111500 perhaps? > > It reliably breaks the graphics userland, as the machine consistently > freezes at any random moment. > > Any workaround options, even if with a performance penalty? Revert > cb6d7c7dc7ff but side effects? > > 5.3 has useful NVMe power mgmt updates for laptops, I'd like to stick > with the newest if possible. There's a considerable risk that many users will start seeing this regression when 5.3 is released. I am not aware of a workaround. Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr: Acquire the page lock around set_page_dirty()")? And if we do, what would be the consequences? Would other patches need to be reverted, too? Thanks, Martin
Re: [Intel-gfx] 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
On Thu, Sep 12, 2019 at 12:51 PM Martin Wilck wrote: > > Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr: > Acquire the page lock around set_page_dirty()")? And if we do, what > would be the consequences? Would other patches need to be reverted, > too? Looking at that commit, and the backtrace of the lockup, I think that reverting it is the correct thing to do. You can't take the page lock in invalidate_range(), since it's called from try_to_unmap(), which is called with the page lock already held. So commit aa56a292ce62 is just fundamentally completely wrong and should be reverted. Linus ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
On Thu, Sep 12, 2019 at 11:23:09AM +, Martin Wilck wrote: > > There's a considerable risk that many users will start seeing this > regression when 5.3 is released. I am not aware of a workaround. > > Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr: > Acquire the page lock around set_page_dirty()")? And if we do, what > would be the consequences? Would other patches need to be reverted, > too? I've been running with revert patch for a couple of days and have not encountered any kernel warnings thus far, nor any other ill effects that could be attributed to this locking mechanism. But I'm far from familiar with these subsystems. Graphics does not hang anymore. I've also received developer feedback in private that this really should be fixed before 5.3 release. -- Leho Kraav, senior technology & digital marketing architect
Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
On Fri, Aug 09, 2019 at 01:53:43PM +0100, Chris Wilson wrote: > Quoting Martin Wilck (2019-08-09 13:41:42) > > This happened to me today, running kernel 5.3.0-rc3-1.g571863b-default > > (5.3-rc3 with just a few patches on top), after starting a KVM virtual > > machine. The X screen was frozen. Remote login via ssh was still > > possible, thus I was able to retrieve basic logs. > > > > sysrq-w showed two blocked processes (kcompactd0 and KVM). After a > > minute, the same two processes were still blocked. KVM seems to try to > > acquire a lock that kcompactd is holding. kcompactd is waiting for IO > > to complete on pages owned by the i915 driver. > > My bad, it's known. We haven't decided on whether to revert the > unfortunate recursive locking (and so hit another warn elsewhere) or to > ignore the dirty pages (and so risk losing data across swap). > > cb6d7c7dc7ff ("drm/i915/userptr: Acquire the page lock around > set_page_dirty()") > -Chris Hi Chris. Is this exactly what I'm hitting at https://bugs.freedesktop.org/show_bug.cgi?id=111500 perhaps? It reliably breaks the graphics userland, as the machine consistently freezes at any random moment. Any workaround options, even if with a performance penalty? Revert cb6d7c7dc7ff but side effects? 5.3 has useful NVMe power mgmt updates for laptops, I'd like to stick with the newest if possible. -- Leho Kraav, senior technology & digital marketing architect
[Intel-gfx] ✗ Fi.CI.BAT: failure for 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
== Series Details == Series: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages URL : https://patchwork.freedesktop.org/series/64983/ State : failure == Summary == Applying: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages error: sha1 information is lacking or useless (drivers/gpu/drm/i915/gem/i915_gem_userptr.c). error: could not build fake ancestor hint: Use 'git am --show-current-patch' to see the failed patch Patch failed at 0001 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages When you have resolved this problem, run "git am --continue". If you prefer to skip this patch, run "git am --skip" instead. To restore the original branch and stop patching, run "git am --abort". ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
[off topic: plain text mail please] On Fri, 9 Aug 2019 12:41:42 + Martin Wilck wrote: > > This happened to me today, running kernel 5.3.0-rc3-1.g571863b-default > (5.3-rc3 with just a few patches on top), after starting a KVM virtual > machine. The X screen was frozen. Remote login via ssh was still > possible, thus I was able to retrieve basic logs. Thanks for report. > > sysrq-w showed two blocked processes (kcompactd0 and KVM). After a > minute, the same two processes were still blocked. KVM seems to try to > acquire a lock that kcompactd is holding. kcompactd is waiting for IO > to complete on pages owned by the i915 driver. > > kcompactd stack: > > Aug 09 12:12:48 apollon.suse.de kernel: sysrq: Show Blocked State > Aug 09 12:12:48 apollon.suse.de kernel: taskPC stack > pid father > Aug 09 12:12:48 apollon.suse.de kernel: kcompactd0 D043 2 > 0x80004000 > Aug 09 12:12:48 apollon.suse.de kernel: Call Trace: > Aug 09 12:12:48 apollon.suse.de kernel: ? __schedule+0x2af/0x6a0 > Aug 09 12:12:48 apollon.suse.de kernel: schedule+0x33/0x90 > Aug 09 12:12:48 apollon.suse.de kernel: io_schedule+0x12/0x40 > Aug 09 12:12:48 apollon.suse.de kernel: __lock_page+0x123/0x200 > Aug 09 12:12:48 apollon.suse.de kernel: ? gen8_ppgtt_clear_pdp+0xc0/0x140 > [i915] > Aug 09 12:12:48 apollon.suse.de kernel: ? file_fdatawait_range+0x20/0x20 > Aug 09 12:12:48 apollon.suse.de kernel: set_page_dirty_lock+0x49/0x50 > Aug 09 12:12:48 apollon.suse.de kernel: > i915_gem_userptr_put_pages+0x13f/0x1c0 [i915] The two lines above show commit aa56a292ce62 ("drm/i915/userptr: Acquire the page lock around set_page_dirty()") is culprit. > Aug 09 12:12:48 apollon.suse.de kernel: > __i915_gem_object_put_pages+0x5e/0xa0 [i915] > Aug 09 12:12:48 apollon.suse.de kernel: > userptr_mn_invalidate_range_start+0x1ff/0x220 [i915] > Aug 09 12:12:48 apollon.suse.de kernel: > __mmu_notifier_invalidate_range_start+0x57/0xa0 > Aug 09 12:12:48 apollon.suse.de kernel: try_to_unmap_one+0xa0b/0xae0 > Aug 09 12:12:48 apollon.suse.de kernel: ? __mod_lruvec_state+0x3f/0xf0 > Aug 09 12:12:48 apollon.suse.de kernel: rmap_walk_file+0xf2/0x250 > Aug 09 12:12:48 apollon.suse.de kernel: try_to_unmap+0xa6/0xe0 Page is locked before try_to_unmap(), and dirty page table entry is handled in try_to_unmap_one(), so what was added in aa56a292ce62 is a bit of overaction in this call trace. A bigger pain is it can not be reverted because of the Fixes tag in it. > Aug 09 12:12:48 apollon.suse.de kernel: ? page_remove_rmap+0x290/0x290 > Aug 09 12:12:48 apollon.suse.de kernel: ? page_not_mapped+0x20/0x20 > Aug 09 12:12:48 apollon.suse.de kernel: ? page_get_anon_vma+0x80/0x80 > Aug 09 12:12:48 apollon.suse.de kernel: migrate_pages+0x8cd/0xbc0 > Aug 09 12:12:48 apollon.suse.de kernel: ? fast_isolate_freepages+0x6b0/0x6b0 > Aug 09 12:12:48 apollon.suse.de kernel: ? move_freelist_tail+0xb0/0xb0 > Aug 09 12:12:48 apollon.suse.de kernel: compact_zone+0x669/0xc80 > Aug 09 12:12:48 apollon.suse.de kernel: ? > entry_SYSCALL_64_after_hwframe+0xb8/0xbe > Aug 09 12:12:48 apollon.suse.de kernel: kcompactd_do_work+0x120/0x290 > > KVM stack: > > Aug 09 12:12:48 apollon.suse.de kernel: CPU 0/KVM D0 25189 1 > 0x0320 > Aug 09 12:12:48 apollon.suse.de kernel: Call Trace: > Aug 09 12:12:48 apollon.suse.de kernel: ? __schedule+0x2af/0x6a0 > Aug 09 12:12:48 apollon.suse.de kernel: schedule+0x33/0x90 > Aug 09 12:12:48 apollon.suse.de kernel: schedule_preempt_disabled+0xa/0x10 > Aug 09 12:12:48 apollon.suse.de kernel: __mutex_lock.isra.0+0x172/0x4d0 > Aug 09 12:12:48 apollon.suse.de kernel: > userptr_mn_invalidate_range_start+0x1bf/0x220 [i915] > Aug 09 12:12:48 apollon.suse.de kernel: > __mmu_notifier_invalidate_range_start+0x57/0xa0 > Aug 09 12:12:48 apollon.suse.de kernel: try_to_unmap_one+0xa0b/0xae0 > Aug 09 12:12:48 apollon.suse.de kernel: rmap_walk_file+0xf2/0x250 > Aug 09 12:12:48 apollon.suse.de kernel: try_to_unmap+0xa6/0xe0 > Aug 09 12:12:48 apollon.suse.de kernel: ? page_remove_rmap+0x290/0x290 > Aug 09 12:12:48 apollon.suse.de kernel: ? page_not_mapped+0x20/0x20 > Aug 09 12:12:48 apollon.suse.de kernel: ? page_get_anon_vma+0x80/0x80 > Aug 09 12:12:48 apollon.suse.de kernel: migrate_pages+0x8cd/0xbc0 > Aug 09 12:12:48 apollon.suse.de kernel: ? fast_isolate_freepages+0x6b0/0x6b0 > Aug 09 12:12:48 apollon.suse.de kernel: ? move_freelist_tail+0xb0/0xb0 > Aug 09 12:12:48 apollon.suse.de kernel: compact_zone+0x669/0xc80 > Aug 09 12:12:48 apollon.suse.de kernel: compact_zone_order+0xc6/0xf0 > Aug 09 12:12:48 apollon.suse.de kernel: try_to_compact_pages+0xcc/0x2a0 > Aug 09 12:12:48 apollon.suse.de kernel: > __alloc_pages_direct_compact+0x7c/0x150 > Aug 09 12:12:48 apollon.suse.de kernel: __alloc_pages_slowpath+0x1ee/0xd00 > Aug 09 12:12:48 apollon.suse.de kernel: ? vmx_vcpu_load+0x100/0x120 > [kvm_intel] > > Full logs can be found under https://pastebin.com/KJ6tccj
Re: [Intel-gfx] 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
Quoting Martin Wilck (2019-08-09 13:41:42) > This happened to me today, running kernel 5.3.0-rc3-1.g571863b-default > (5.3-rc3 with just a few patches on top), after starting a KVM virtual > machine. The X screen was frozen. Remote login via ssh was still > possible, thus I was able to retrieve basic logs. > > sysrq-w showed two blocked processes (kcompactd0 and KVM). After a > minute, the same two processes were still blocked. KVM seems to try to > acquire a lock that kcompactd is holding. kcompactd is waiting for IO > to complete on pages owned by the i915 driver. My bad, it's known. We haven't decided on whether to revert the unfortunate recursive locking (and so hit another warn elsewhere) or to ignore the dirty pages (and so risk losing data across swap). cb6d7c7dc7ff ("drm/i915/userptr: Acquire the page lock around set_page_dirty()") -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
5.3-rc3: Frozen graphics with kcompactd migrating i915 pages
This happened to me today, running kernel 5.3.0-rc3-1.g571863b-default (5.3-rc3 with just a few patches on top), after starting a KVM virtual machine. The X screen was frozen. Remote login via ssh was still possible, thus I was able to retrieve basic logs. sysrq-w showed two blocked processes (kcompactd0 and KVM). After a minute, the same two processes were still blocked. KVM seems to try to acquire a lock that kcompactd is holding. kcompactd is waiting for IO to complete on pages owned by the i915 driver. kcompactd stack: Aug 09 12:12:48 apollon.suse.de kernel: sysrq: Show Blocked State Aug 09 12:12:48 apollon.suse.de kernel: taskPC stack pid father Aug 09 12:12:48 apollon.suse.de kernel: kcompactd0 D043 2 0x80004000 Aug 09 12:12:48 apollon.suse.de kernel: Call Trace: Aug 09 12:12:48 apollon.suse.de kernel: ? __schedule+0x2af/0x6a0 Aug 09 12:12:48 apollon.suse.de kernel: schedule+0x33/0x90 Aug 09 12:12:48 apollon.suse.de kernel: io_schedule+0x12/0x40 Aug 09 12:12:48 apollon.suse.de kernel: __lock_page+0x123/0x200 Aug 09 12:12:48 apollon.suse.de kernel: ? gen8_ppgtt_clear_pdp+0xc0/0x140 [i915] Aug 09 12:12:48 apollon.suse.de kernel: ? file_fdatawait_range+0x20/0x20 Aug 09 12:12:48 apollon.suse.de kernel: set_page_dirty_lock+0x49/0x50 Aug 09 12:12:48 apollon.suse.de kernel: i915_gem_userptr_put_pages+0x13f/0x1c0 [i915] Aug 09 12:12:48 apollon.suse.de kernel: __i915_gem_object_put_pages+0x5e/0xa0 [i915] Aug 09 12:12:48 apollon.suse.de kernel: userptr_mn_invalidate_range_start+0x1ff/0x220 [i915] Aug 09 12:12:48 apollon.suse.de kernel: __mmu_notifier_invalidate_range_start+0x57/0xa0 Aug 09 12:12:48 apollon.suse.de kernel: try_to_unmap_one+0xa0b/0xae0 Aug 09 12:12:48 apollon.suse.de kernel: ? __mod_lruvec_state+0x3f/0xf0 Aug 09 12:12:48 apollon.suse.de kernel: rmap_walk_file+0xf2/0x250 Aug 09 12:12:48 apollon.suse.de kernel: try_to_unmap+0xa6/0xe0 Aug 09 12:12:48 apollon.suse.de kernel: ? page_remove_rmap+0x290/0x290 Aug 09 12:12:48 apollon.suse.de kernel: ? page_not_mapped+0x20/0x20 Aug 09 12:12:48 apollon.suse.de kernel: ? page_get_anon_vma+0x80/0x80 Aug 09 12:12:48 apollon.suse.de kernel: migrate_pages+0x8cd/0xbc0 Aug 09 12:12:48 apollon.suse.de kernel: ? fast_isolate_freepages+0x6b0/0x6b0 Aug 09 12:12:48 apollon.suse.de kernel: ? move_freelist_tail+0xb0/0xb0 Aug 09 12:12:48 apollon.suse.de kernel: compact_zone+0x669/0xc80 Aug 09 12:12:48 apollon.suse.de kernel: ? entry_SYSCALL_64_after_hwframe+0xb8/0xbe Aug 09 12:12:48 apollon.suse.de kernel: kcompactd_do_work+0x120/0x290 KVM stack: Aug 09 12:12:48 apollon.suse.de kernel: CPU 0/KVM D0 25189 1 0x0320 Aug 09 12:12:48 apollon.suse.de kernel: Call Trace: Aug 09 12:12:48 apollon.suse.de kernel: ? __schedule+0x2af/0x6a0 Aug 09 12:12:48 apollon.suse.de kernel: schedule+0x33/0x90 Aug 09 12:12:48 apollon.suse.de kernel: schedule_preempt_disabled+0xa/0x10 Aug 09 12:12:48 apollon.suse.de kernel: __mutex_lock.isra.0+0x172/0x4d0 Aug 09 12:12:48 apollon.suse.de kernel: userptr_mn_invalidate_range_start+0x1bf/0x220 [i915] Aug 09 12:12:48 apollon.suse.de kernel: __mmu_notifier_invalidate_range_start+0x57/0xa0 Aug 09 12:12:48 apollon.suse.de kernel: try_to_unmap_one+0xa0b/0xae0 Aug 09 12:12:48 apollon.suse.de kernel: rmap_walk_file+0xf2/0x250 Aug 09 12:12:48 apollon.suse.de kernel: try_to_unmap+0xa6/0xe0 Aug 09 12:12:48 apollon.suse.de kernel: ? page_remove_rmap+0x290/0x290 Aug 09 12:12:48 apollon.suse.de kernel: ? page_not_mapped+0x20/0x20 Aug 09 12:12:48 apollon.suse.de kernel: ? page_get_anon_vma+0x80/0x80 Aug 09 12:12:48 apollon.suse.de kernel: migrate_pages+0x8cd/0xbc0 Aug 09 12:12:48 apollon.suse.de kernel: ? fast_isolate_freepages+0x6b0/0x6b0 Aug 09 12:12:48 apollon.suse.de kernel: ? move_freelist_tail+0xb0/0xb0 Aug 09 12:12:48 apollon.suse.de kernel: compact_zone+0x669/0xc80 Aug 09 12:12:48 apollon.suse.de kernel: compact_zone_order+0xc6/0xf0 Aug 09 12:12:48 apollon.suse.de kernel: try_to_compact_pages+0xcc/0x2a0 Aug 09 12:12:48 apollon.suse.de kernel: __alloc_pages_direct_compact+0x7c/0x150 Aug 09 12:12:48 apollon.suse.de kernel: __alloc_pages_slowpath+0x1ee/0xd00 Aug 09 12:12:48 apollon.suse.de kernel: ? vmx_vcpu_load+0x100/0x120 [kvm_intel] Full logs can be found under https://pastebin.com/KJ6tccj4 I haven't yet tried if this is reproducible. Regards Martin