Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore
Quoting Daniel Vetter (2018-03-27 11:01:09) > On Tue, Mar 27, 2018 at 08:19:33AM +0100, Chris Wilson wrote: > > Quoting Daniel Vetter (2018-03-27 07:48:00) > > > On Mon, Mar 26, 2018 at 11:38:55PM +0100, Chris Wilson wrote: > > > > Quoting Chris Wilson (2018-03-26 21:08:33) > > > > > Quoting Patchwork (2018-03-26 17:53:44) > > > > > > Test gem_userptr_blits: > > > > > > Subgroup coherency-unsync: > > > > > > pass -> INCOMPLETE (shard-hsw) > > > > > > > > > > Forgot that obj->userptr.mn may not exist. > > > > > > > > > > > Subgroup dmabuf-sync: > > > > > > pass -> DMESG-WARN (shard-hsw) > > > > > > > > > > But this is the tricky lockdep one, warning of the recursion from gup > > > > > into mmu_invalidate_range, i.e. > > > > > > > > > > down_read(&i915_mmu_notifier->sem); > > > > > down_read(&mm_struct->mmap_sem); > > > > > gup(); > > > > > down_write(&i915_mmut_notifier->sem); > > > > > > > > > > That seems a genuine deadlock... So I wonder how we managed to get a > > > > > lockdep splat and not a dead machine. Maybe gup never triggers the > > > > > recursion for our set of flags? Hmm. > > > > > > > > In another universe, CI found > > > > > > > > [ 255.666496] == > > > > [ 255.666498] WARNING: possible circular locking dependency detected > > > > [ 255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U W > > > > [ 255.666502] -- > > > > [ 255.666503] gem_userptr_bli/4794 is trying to acquire lock: > > > > [ 255.666505] (fs_reclaim){+.+.}, at: [] > > > > fs_reclaim_acquire.part.12+0x0/0x30 > > > > [ 255.666510] > > > >but task is already holding lock: > > > > [ 255.666512] (&mn->sem){+.+.}, at: [<7c59ba79>] > > > > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915] > > > > [ 255.666553] > > > >which lock already depends on the new lock. > > > > > > > > [ 255.666555] > > > >the existing dependency chain (in reverse order) is: > > > > [ 255.666557] > > > >-> #2 (&mn->sem){+.+.}: > > > > [ 255.666578] > > > > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915] > > > > [ 255.666581]__mmu_notifier_invalidate_range_start+0x73/0xb0 > > > > [ 255.666584]zap_page_range_single+0xcc/0xe0 > > > > [ 255.666586]unmap_mapping_pages+0xd4/0x110 > > > > [ 255.06]i915_vma_revoke_mmap+0x7e/0x1c0 [i915] > > > > [ 255.25]i915_vma_unbind+0x60a/0xa10 [i915] > > > > [ 255.44]i915_gem_object_set_tiling+0xf6/0x5b0 [i915] > > > > [ 255.62]i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915] > > > > [ 255.65]drm_ioctl_kernel+0x60/0xa0 > > > > [ 255.67]drm_ioctl+0x27e/0x320 > > > > [ 255.69]do_vfs_ioctl+0x8a/0x670 > > > > [ 255.70]SyS_ioctl+0x36/0x70 > > > > [ 255.72]do_syscall_64+0x65/0x1a0 > > > > [ 255.75]entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > > > [ 255.76] > > > >-> #1 (&mapping->i_mmap_rwsem){}: > > > > [ 255.80]unmap_mapping_pages+0x3d/0x110 > > > > [ 255.98]i915_vma_revoke_mmap+0x7e/0x1c0 [i915] > > > > [ 255.666716]i915_vma_unbind+0x60a/0xa10 [i915] > > > > [ 255.666734]i915_gem_object_unbind+0xa0/0x130 [i915] > > > > [ 255.666751]i915_gem_shrink+0x2d1/0x5d0 [i915] > > > > [ 255.666767]i915_drop_caches_set+0x92/0x190 [i915] > > > > [ 255.666770]simple_attr_write+0xab/0xc0 > > > > [ 255.666772]full_proxy_write+0x4b/0x70 > > > > [ 255.666774]__vfs_write+0x1e/0x130 > > > > [ 255.666776]vfs_write+0xbd/0x1b0 > > > > [ 255.666778]SyS_write+0x40/0xa0 > > > > [ 255.666779]do_syscall_64+0x65/0x1a0 > > > > [ 255.666781]entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > > > [ 255.666783] > > > >-> #0 (fs_reclaim){+.+.}: > > > > [ 255.666786]fs_reclaim_acquire.part.12+0x24/0x30 > > > > [ 255.666788]__alloc_pages_nodemask+0x1f1/0x11d0 > > > > [ 255.666790]__get_free_pages+0x9/0x40 > > > > [ 255.666792]__pud_alloc+0x25/0xb0 > > > > [ 255.666794]copy_page_range+0xa75/0xaf0 > > > > [ 255.666796]copy_process.part.7+0x1267/0x1d90 > > > > [ 255.666798]_do_fork+0xc0/0x6b0 > > > > [ 255.666800]do_syscall_64+0x65/0x1a0 > > > > [ 255.666801]entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > > > [ 255.666803] > > > >other info that might help us debug this: > > > > > > > > [ 255.666805] Chain exists of: > > > > fs_reclaim --> &mapping->i_mmap_rwsem --> &mn->sem > > > > > > > > [ 255.666809] Possible unsafe locking scenario: > > > > > > > > [ 255.666811]CPU0CPU1
Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore
On Tue, Mar 27, 2018 at 08:19:33AM +0100, Chris Wilson wrote: > Quoting Daniel Vetter (2018-03-27 07:48:00) > > On Mon, Mar 26, 2018 at 11:38:55PM +0100, Chris Wilson wrote: > > > Quoting Chris Wilson (2018-03-26 21:08:33) > > > > Quoting Patchwork (2018-03-26 17:53:44) > > > > > Test gem_userptr_blits: > > > > > Subgroup coherency-unsync: > > > > > pass -> INCOMPLETE (shard-hsw) > > > > > > > > Forgot that obj->userptr.mn may not exist. > > > > > > > > > Subgroup dmabuf-sync: > > > > > pass -> DMESG-WARN (shard-hsw) > > > > > > > > But this is the tricky lockdep one, warning of the recursion from gup > > > > into mmu_invalidate_range, i.e. > > > > > > > > down_read(&i915_mmu_notifier->sem); > > > > down_read(&mm_struct->mmap_sem); > > > > gup(); > > > > down_write(&i915_mmut_notifier->sem); > > > > > > > > That seems a genuine deadlock... So I wonder how we managed to get a > > > > lockdep splat and not a dead machine. Maybe gup never triggers the > > > > recursion for our set of flags? Hmm. > > > > > > In another universe, CI found > > > > > > [ 255.666496] == > > > [ 255.666498] WARNING: possible circular locking dependency detected > > > [ 255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U W > > > [ 255.666502] -- > > > [ 255.666503] gem_userptr_bli/4794 is trying to acquire lock: > > > [ 255.666505] (fs_reclaim){+.+.}, at: [] > > > fs_reclaim_acquire.part.12+0x0/0x30 > > > [ 255.666510] > > >but task is already holding lock: > > > [ 255.666512] (&mn->sem){+.+.}, at: [<7c59ba79>] > > > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915] > > > [ 255.666553] > > >which lock already depends on the new lock. > > > > > > [ 255.666555] > > >the existing dependency chain (in reverse order) is: > > > [ 255.666557] > > >-> #2 (&mn->sem){+.+.}: > > > [ 255.666578] > > > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915] > > > [ 255.666581]__mmu_notifier_invalidate_range_start+0x73/0xb0 > > > [ 255.666584]zap_page_range_single+0xcc/0xe0 > > > [ 255.666586]unmap_mapping_pages+0xd4/0x110 > > > [ 255.06]i915_vma_revoke_mmap+0x7e/0x1c0 [i915] > > > [ 255.25]i915_vma_unbind+0x60a/0xa10 [i915] > > > [ 255.44]i915_gem_object_set_tiling+0xf6/0x5b0 [i915] > > > [ 255.62]i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915] > > > [ 255.65]drm_ioctl_kernel+0x60/0xa0 > > > [ 255.67]drm_ioctl+0x27e/0x320 > > > [ 255.69]do_vfs_ioctl+0x8a/0x670 > > > [ 255.70]SyS_ioctl+0x36/0x70 > > > [ 255.72]do_syscall_64+0x65/0x1a0 > > > [ 255.75]entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > > [ 255.76] > > >-> #1 (&mapping->i_mmap_rwsem){}: > > > [ 255.80]unmap_mapping_pages+0x3d/0x110 > > > [ 255.98]i915_vma_revoke_mmap+0x7e/0x1c0 [i915] > > > [ 255.666716]i915_vma_unbind+0x60a/0xa10 [i915] > > > [ 255.666734]i915_gem_object_unbind+0xa0/0x130 [i915] > > > [ 255.666751]i915_gem_shrink+0x2d1/0x5d0 [i915] > > > [ 255.666767]i915_drop_caches_set+0x92/0x190 [i915] > > > [ 255.666770]simple_attr_write+0xab/0xc0 > > > [ 255.666772]full_proxy_write+0x4b/0x70 > > > [ 255.666774]__vfs_write+0x1e/0x130 > > > [ 255.666776]vfs_write+0xbd/0x1b0 > > > [ 255.666778]SyS_write+0x40/0xa0 > > > [ 255.666779]do_syscall_64+0x65/0x1a0 > > > [ 255.666781]entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > > [ 255.666783] > > >-> #0 (fs_reclaim){+.+.}: > > > [ 255.666786]fs_reclaim_acquire.part.12+0x24/0x30 > > > [ 255.666788]__alloc_pages_nodemask+0x1f1/0x11d0 > > > [ 255.666790]__get_free_pages+0x9/0x40 > > > [ 255.666792]__pud_alloc+0x25/0xb0 > > > [ 255.666794]copy_page_range+0xa75/0xaf0 > > > [ 255.666796]copy_process.part.7+0x1267/0x1d90 > > > [ 255.666798]_do_fork+0xc0/0x6b0 > > > [ 255.666800]do_syscall_64+0x65/0x1a0 > > > [ 255.666801]entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > > [ 255.666803] > > >other info that might help us debug this: > > > > > > [ 255.666805] Chain exists of: > > > fs_reclaim --> &mapping->i_mmap_rwsem --> &mn->sem > > > > > > [ 255.666809] Possible unsafe locking scenario: > > > > > > [ 255.666811]CPU0CPU1 > > > [ 255.666812] > > > [ 255.666814] lock(&mn->sem); > > > [ 255.666815] > > > lock(&mapping->i_mmap_rwsem); > > > [ 255.666817]
Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore
Quoting Daniel Vetter (2018-03-27 08:01:17) > On Mon, Mar 26, 2018 at 09:08:33PM +0100, Chris Wilson wrote: > > Quoting Patchwork (2018-03-26 17:53:44) > > > Test gem_userptr_blits: > > > Subgroup coherency-unsync: > > > pass -> INCOMPLETE (shard-hsw) > > > > Forgot that obj->userptr.mn may not exist. > > > > > Subgroup dmabuf-sync: > > > pass -> DMESG-WARN (shard-hsw) > > > > But this is the tricky lockdep one, warning of the recursion from gup > > into mmu_invalidate_range, i.e. > > > > down_read(&i915_mmu_notifier->sem); > > down_read(&mm_struct->mmap_sem); > > gup(); > > down_write(&i915_mmut_notifier->sem); > > > > That seems a genuine deadlock... So I wonder how we managed to get a > > lockdep splat and not a dead machine. Maybe gup never triggers the > > recursion for our set of flags? Hmm. > > Coffee starting to kick in. If we gup a range it's likely the mm won't > kick out the same range, but something else. I guess we'd need a really > huge userptr bo which can't fit into core completely to actually have a > reliably chance at triggering this. Would probably deadlock the box :-/ > > I think Jerome's recommendation is the sequence counter stuff from kvm, > plus retrying forever on the gup side. That would convert the same > deadlock into a livelock, but well can't have it all :-) Pre-coffee state also thinks it would trigger the second fs_reclaim lockdep if it was sufficiently annotated. -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore
Quoting Daniel Vetter (2018-03-27 07:48:00) > On Mon, Mar 26, 2018 at 11:38:55PM +0100, Chris Wilson wrote: > > Quoting Chris Wilson (2018-03-26 21:08:33) > > > Quoting Patchwork (2018-03-26 17:53:44) > > > > Test gem_userptr_blits: > > > > Subgroup coherency-unsync: > > > > pass -> INCOMPLETE (shard-hsw) > > > > > > Forgot that obj->userptr.mn may not exist. > > > > > > > Subgroup dmabuf-sync: > > > > pass -> DMESG-WARN (shard-hsw) > > > > > > But this is the tricky lockdep one, warning of the recursion from gup > > > into mmu_invalidate_range, i.e. > > > > > > down_read(&i915_mmu_notifier->sem); > > > down_read(&mm_struct->mmap_sem); > > > gup(); > > > down_write(&i915_mmut_notifier->sem); > > > > > > That seems a genuine deadlock... So I wonder how we managed to get a > > > lockdep splat and not a dead machine. Maybe gup never triggers the > > > recursion for our set of flags? Hmm. > > > > In another universe, CI found > > > > [ 255.666496] == > > [ 255.666498] WARNING: possible circular locking dependency detected > > [ 255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U W > > [ 255.666502] -- > > [ 255.666503] gem_userptr_bli/4794 is trying to acquire lock: > > [ 255.666505] (fs_reclaim){+.+.}, at: [] > > fs_reclaim_acquire.part.12+0x0/0x30 > > [ 255.666510] > >but task is already holding lock: > > [ 255.666512] (&mn->sem){+.+.}, at: [<7c59ba79>] > > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915] > > [ 255.666553] > >which lock already depends on the new lock. > > > > [ 255.666555] > >the existing dependency chain (in reverse order) is: > > [ 255.666557] > >-> #2 (&mn->sem){+.+.}: > > [ 255.666578]i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 > > [i915] > > [ 255.666581]__mmu_notifier_invalidate_range_start+0x73/0xb0 > > [ 255.666584]zap_page_range_single+0xcc/0xe0 > > [ 255.666586]unmap_mapping_pages+0xd4/0x110 > > [ 255.06]i915_vma_revoke_mmap+0x7e/0x1c0 [i915] > > [ 255.25]i915_vma_unbind+0x60a/0xa10 [i915] > > [ 255.44]i915_gem_object_set_tiling+0xf6/0x5b0 [i915] > > [ 255.62]i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915] > > [ 255.65]drm_ioctl_kernel+0x60/0xa0 > > [ 255.67]drm_ioctl+0x27e/0x320 > > [ 255.69]do_vfs_ioctl+0x8a/0x670 > > [ 255.70]SyS_ioctl+0x36/0x70 > > [ 255.72]do_syscall_64+0x65/0x1a0 > > [ 255.75]entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > [ 255.76] > >-> #1 (&mapping->i_mmap_rwsem){}: > > [ 255.80]unmap_mapping_pages+0x3d/0x110 > > [ 255.98]i915_vma_revoke_mmap+0x7e/0x1c0 [i915] > > [ 255.666716]i915_vma_unbind+0x60a/0xa10 [i915] > > [ 255.666734]i915_gem_object_unbind+0xa0/0x130 [i915] > > [ 255.666751]i915_gem_shrink+0x2d1/0x5d0 [i915] > > [ 255.666767]i915_drop_caches_set+0x92/0x190 [i915] > > [ 255.666770]simple_attr_write+0xab/0xc0 > > [ 255.666772]full_proxy_write+0x4b/0x70 > > [ 255.666774]__vfs_write+0x1e/0x130 > > [ 255.666776]vfs_write+0xbd/0x1b0 > > [ 255.666778]SyS_write+0x40/0xa0 > > [ 255.666779]do_syscall_64+0x65/0x1a0 > > [ 255.666781]entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > [ 255.666783] > >-> #0 (fs_reclaim){+.+.}: > > [ 255.666786]fs_reclaim_acquire.part.12+0x24/0x30 > > [ 255.666788]__alloc_pages_nodemask+0x1f1/0x11d0 > > [ 255.666790]__get_free_pages+0x9/0x40 > > [ 255.666792]__pud_alloc+0x25/0xb0 > > [ 255.666794]copy_page_range+0xa75/0xaf0 > > [ 255.666796]copy_process.part.7+0x1267/0x1d90 > > [ 255.666798]_do_fork+0xc0/0x6b0 > > [ 255.666800]do_syscall_64+0x65/0x1a0 > > [ 255.666801]entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > [ 255.666803] > >other info that might help us debug this: > > > > [ 255.666805] Chain exists of: > > fs_reclaim --> &mapping->i_mmap_rwsem --> &mn->sem > > > > [ 255.666809] Possible unsafe locking scenario: > > > > [ 255.666811]CPU0CPU1 > > [ 255.666812] > > [ 255.666814] lock(&mn->sem); > > [ 255.666815]lock(&mapping->i_mmap_rwsem); > > [ 255.666817]lock(&mn->sem); > > [ 255.666819] lock(fs_reclaim); > > [ 255.666821] > > > > So a shrinker deadlock. That doesn't look easy to wriggle out of, as we > > have a random chunk of code that's between invalidate_range_start and > > invalidate_range_end. > > Chri
Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore
On Mon, Mar 26, 2018 at 09:08:33PM +0100, Chris Wilson wrote: > Quoting Patchwork (2018-03-26 17:53:44) > > Test gem_userptr_blits: > > Subgroup coherency-unsync: > > pass -> INCOMPLETE (shard-hsw) > > Forgot that obj->userptr.mn may not exist. > > > Subgroup dmabuf-sync: > > pass -> DMESG-WARN (shard-hsw) > > But this is the tricky lockdep one, warning of the recursion from gup > into mmu_invalidate_range, i.e. > > down_read(&i915_mmu_notifier->sem); > down_read(&mm_struct->mmap_sem); > gup(); > down_write(&i915_mmut_notifier->sem); > > That seems a genuine deadlock... So I wonder how we managed to get a > lockdep splat and not a dead machine. Maybe gup never triggers the > recursion for our set of flags? Hmm. Coffee starting to kick in. If we gup a range it's likely the mm won't kick out the same range, but something else. I guess we'd need a really huge userptr bo which can't fit into core completely to actually have a reliably chance at triggering this. Would probably deadlock the box :-/ I think Jerome's recommendation is the sequence counter stuff from kvm, plus retrying forever on the gup side. That would convert the same deadlock into a livelock, but well can't have it all :-) And I think once you've killed the task the gup worker hopefully realizes it's wasting time and gives up. For the kvm stuff: Look at #intel-gfx scrollback, we discussed all the necessary bits. Plus Jerome showed some new helpers that would avoid the hand-rolling. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore
On Mon, Mar 26, 2018 at 11:38:55PM +0100, Chris Wilson wrote: > Quoting Chris Wilson (2018-03-26 21:08:33) > > Quoting Patchwork (2018-03-26 17:53:44) > > > Test gem_userptr_blits: > > > Subgroup coherency-unsync: > > > pass -> INCOMPLETE (shard-hsw) > > > > Forgot that obj->userptr.mn may not exist. > > > > > Subgroup dmabuf-sync: > > > pass -> DMESG-WARN (shard-hsw) > > > > But this is the tricky lockdep one, warning of the recursion from gup > > into mmu_invalidate_range, i.e. > > > > down_read(&i915_mmu_notifier->sem); > > down_read(&mm_struct->mmap_sem); > > gup(); > > down_write(&i915_mmut_notifier->sem); > > > > That seems a genuine deadlock... So I wonder how we managed to get a > > lockdep splat and not a dead machine. Maybe gup never triggers the > > recursion for our set of flags? Hmm. > > In another universe, CI found > > [ 255.666496] == > [ 255.666498] WARNING: possible circular locking dependency detected > [ 255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U W > [ 255.666502] -- > [ 255.666503] gem_userptr_bli/4794 is trying to acquire lock: > [ 255.666505] (fs_reclaim){+.+.}, at: [] > fs_reclaim_acquire.part.12+0x0/0x30 > [ 255.666510] >but task is already holding lock: > [ 255.666512] (&mn->sem){+.+.}, at: [<7c59ba79>] > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915] > [ 255.666553] >which lock already depends on the new lock. > > [ 255.666555] >the existing dependency chain (in reverse order) is: > [ 255.666557] >-> #2 (&mn->sem){+.+.}: > [ 255.666578]i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 > [i915] > [ 255.666581]__mmu_notifier_invalidate_range_start+0x73/0xb0 > [ 255.666584]zap_page_range_single+0xcc/0xe0 > [ 255.666586]unmap_mapping_pages+0xd4/0x110 > [ 255.06]i915_vma_revoke_mmap+0x7e/0x1c0 [i915] > [ 255.25]i915_vma_unbind+0x60a/0xa10 [i915] > [ 255.44]i915_gem_object_set_tiling+0xf6/0x5b0 [i915] > [ 255.62]i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915] > [ 255.65]drm_ioctl_kernel+0x60/0xa0 > [ 255.67]drm_ioctl+0x27e/0x320 > [ 255.69]do_vfs_ioctl+0x8a/0x670 > [ 255.70]SyS_ioctl+0x36/0x70 > [ 255.72]do_syscall_64+0x65/0x1a0 > [ 255.75]entry_SYSCALL_64_after_hwframe+0x42/0xb7 > [ 255.76] >-> #1 (&mapping->i_mmap_rwsem){}: > [ 255.80]unmap_mapping_pages+0x3d/0x110 > [ 255.98]i915_vma_revoke_mmap+0x7e/0x1c0 [i915] > [ 255.666716]i915_vma_unbind+0x60a/0xa10 [i915] > [ 255.666734]i915_gem_object_unbind+0xa0/0x130 [i915] > [ 255.666751]i915_gem_shrink+0x2d1/0x5d0 [i915] > [ 255.666767]i915_drop_caches_set+0x92/0x190 [i915] > [ 255.666770]simple_attr_write+0xab/0xc0 > [ 255.666772]full_proxy_write+0x4b/0x70 > [ 255.666774]__vfs_write+0x1e/0x130 > [ 255.666776]vfs_write+0xbd/0x1b0 > [ 255.666778]SyS_write+0x40/0xa0 > [ 255.666779]do_syscall_64+0x65/0x1a0 > [ 255.666781]entry_SYSCALL_64_after_hwframe+0x42/0xb7 > [ 255.666783] >-> #0 (fs_reclaim){+.+.}: > [ 255.666786]fs_reclaim_acquire.part.12+0x24/0x30 > [ 255.666788]__alloc_pages_nodemask+0x1f1/0x11d0 > [ 255.666790]__get_free_pages+0x9/0x40 > [ 255.666792]__pud_alloc+0x25/0xb0 > [ 255.666794]copy_page_range+0xa75/0xaf0 > [ 255.666796]copy_process.part.7+0x1267/0x1d90 > [ 255.666798]_do_fork+0xc0/0x6b0 > [ 255.666800]do_syscall_64+0x65/0x1a0 > [ 255.666801]entry_SYSCALL_64_after_hwframe+0x42/0xb7 > [ 255.666803] >other info that might help us debug this: > > [ 255.666805] Chain exists of: > fs_reclaim --> &mapping->i_mmap_rwsem --> &mn->sem > > [ 255.666809] Possible unsafe locking scenario: > > [ 255.666811]CPU0CPU1 > [ 255.666812] > [ 255.666814] lock(&mn->sem); > [ 255.666815]lock(&mapping->i_mmap_rwsem); > [ 255.666817]lock(&mn->sem); > [ 255.666819] lock(fs_reclaim); > [ 255.666821] > > So a shrinker deadlock. That doesn't look easy to wriggle out of, as we > have a random chunk of code that's between invalidate_range_start and > invalidate_range_end. Christian König said something like "with this design you can't allocate anything while holding locks you might need from the mmu notifier". Because reclaim eats into the mmu notifiers. But hey it's before coffee, so probably best you just ignore me :-) -Dani
Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore
Quoting Chris Wilson (2018-03-26 21:08:33) > Quoting Patchwork (2018-03-26 17:53:44) > > Test gem_userptr_blits: > > Subgroup coherency-unsync: > > pass -> INCOMPLETE (shard-hsw) > > Forgot that obj->userptr.mn may not exist. > > > Subgroup dmabuf-sync: > > pass -> DMESG-WARN (shard-hsw) > > But this is the tricky lockdep one, warning of the recursion from gup > into mmu_invalidate_range, i.e. > > down_read(&i915_mmu_notifier->sem); > down_read(&mm_struct->mmap_sem); > gup(); > down_write(&i915_mmut_notifier->sem); > > That seems a genuine deadlock... So I wonder how we managed to get a > lockdep splat and not a dead machine. Maybe gup never triggers the > recursion for our set of flags? Hmm. In another universe, CI found [ 255.666496] == [ 255.666498] WARNING: possible circular locking dependency detected [ 255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U W [ 255.666502] -- [ 255.666503] gem_userptr_bli/4794 is trying to acquire lock: [ 255.666505] (fs_reclaim){+.+.}, at: [] fs_reclaim_acquire.part.12+0x0/0x30 [ 255.666510] but task is already holding lock: [ 255.666512] (&mn->sem){+.+.}, at: [<7c59ba79>] i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915] [ 255.666553] which lock already depends on the new lock. [ 255.666555] the existing dependency chain (in reverse order) is: [ 255.666557] -> #2 (&mn->sem){+.+.}: [ 255.666578]i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915] [ 255.666581]__mmu_notifier_invalidate_range_start+0x73/0xb0 [ 255.666584]zap_page_range_single+0xcc/0xe0 [ 255.666586]unmap_mapping_pages+0xd4/0x110 [ 255.06]i915_vma_revoke_mmap+0x7e/0x1c0 [i915] [ 255.25]i915_vma_unbind+0x60a/0xa10 [i915] [ 255.44]i915_gem_object_set_tiling+0xf6/0x5b0 [i915] [ 255.62]i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915] [ 255.65]drm_ioctl_kernel+0x60/0xa0 [ 255.67]drm_ioctl+0x27e/0x320 [ 255.69]do_vfs_ioctl+0x8a/0x670 [ 255.70]SyS_ioctl+0x36/0x70 [ 255.72]do_syscall_64+0x65/0x1a0 [ 255.75]entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 255.76] -> #1 (&mapping->i_mmap_rwsem){}: [ 255.80]unmap_mapping_pages+0x3d/0x110 [ 255.98]i915_vma_revoke_mmap+0x7e/0x1c0 [i915] [ 255.666716]i915_vma_unbind+0x60a/0xa10 [i915] [ 255.666734]i915_gem_object_unbind+0xa0/0x130 [i915] [ 255.666751]i915_gem_shrink+0x2d1/0x5d0 [i915] [ 255.666767]i915_drop_caches_set+0x92/0x190 [i915] [ 255.666770]simple_attr_write+0xab/0xc0 [ 255.666772]full_proxy_write+0x4b/0x70 [ 255.666774]__vfs_write+0x1e/0x130 [ 255.666776]vfs_write+0xbd/0x1b0 [ 255.666778]SyS_write+0x40/0xa0 [ 255.666779]do_syscall_64+0x65/0x1a0 [ 255.666781]entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 255.666783] -> #0 (fs_reclaim){+.+.}: [ 255.666786]fs_reclaim_acquire.part.12+0x24/0x30 [ 255.666788]__alloc_pages_nodemask+0x1f1/0x11d0 [ 255.666790]__get_free_pages+0x9/0x40 [ 255.666792]__pud_alloc+0x25/0xb0 [ 255.666794]copy_page_range+0xa75/0xaf0 [ 255.666796]copy_process.part.7+0x1267/0x1d90 [ 255.666798]_do_fork+0xc0/0x6b0 [ 255.666800]do_syscall_64+0x65/0x1a0 [ 255.666801]entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 255.666803] other info that might help us debug this: [ 255.666805] Chain exists of: fs_reclaim --> &mapping->i_mmap_rwsem --> &mn->sem [ 255.666809] Possible unsafe locking scenario: [ 255.666811]CPU0CPU1 [ 255.666812] [ 255.666814] lock(&mn->sem); [ 255.666815]lock(&mapping->i_mmap_rwsem); [ 255.666817]lock(&mn->sem); [ 255.666819] lock(fs_reclaim); [ 255.666821] So a shrinker deadlock. That doesn't look easy to wriggle out of, as we have a random chunk of code that's between invalidate_range_start and invalidate_range_end. -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore
Quoting Patchwork (2018-03-26 17:53:44) > Test gem_userptr_blits: > Subgroup coherency-unsync: > pass -> INCOMPLETE (shard-hsw) Forgot that obj->userptr.mn may not exist. > Subgroup dmabuf-sync: > pass -> DMESG-WARN (shard-hsw) But this is the tricky lockdep one, warning of the recursion from gup into mmu_invalidate_range, i.e. down_read(&i915_mmu_notifier->sem); down_read(&mm_struct->mmap_sem); gup(); down_write(&i915_mmut_notifier->sem); That seems a genuine deadlock... So I wonder how we managed to get a lockdep splat and not a dead machine. Maybe gup never triggers the recursion for our set of flags? Hmm. -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx