Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-27 Thread Chris Wilson
Quoting Daniel Vetter (2018-03-27 11:01:09)
> On Tue, Mar 27, 2018 at 08:19:33AM +0100, Chris Wilson wrote:
> > Quoting Daniel Vetter (2018-03-27 07:48:00)
> > > On Mon, Mar 26, 2018 at 11:38:55PM +0100, Chris Wilson wrote:
> > > > Quoting Chris Wilson (2018-03-26 21:08:33)
> > > > > Quoting Patchwork (2018-03-26 17:53:44)
> > > > > > Test gem_userptr_blits:
> > > > > > Subgroup coherency-unsync:
> > > > > > pass   -> INCOMPLETE (shard-hsw)
> > > > > 
> > > > > Forgot that obj->userptr.mn may not exist.
> > > > > 
> > > > > > Subgroup dmabuf-sync:
> > > > > > pass   -> DMESG-WARN (shard-hsw)
> > > > > 
> > > > > But this is the tricky lockdep one, warning of the recursion from gup
> > > > > into mmu_invalidate_range, i.e.
> > > > > 
> > > > > down_read(&i915_mmu_notifier->sem);
> > > > > down_read(&mm_struct->mmap_sem);
> > > > > gup();
> > > > > down_write(&i915_mmut_notifier->sem);
> > > > > 
> > > > > That seems a genuine deadlock... So I wonder how we managed to get a
> > > > > lockdep splat and not a dead machine. Maybe gup never triggers the
> > > > > recursion for our set of flags? Hmm.
> > > > 
> > > > In another universe, CI found
> > > > 
> > > > [  255.666496] ==
> > > > [  255.666498] WARNING: possible circular locking dependency detected
> > > > [  255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U  W   
> > > > [  255.666502] --
> > > > [  255.666503] gem_userptr_bli/4794 is trying to acquire lock:
> > > > [  255.666505]  (fs_reclaim){+.+.}, at: [] 
> > > > fs_reclaim_acquire.part.12+0x0/0x30
> > > > [  255.666510] 
> > > >but task is already holding lock:
> > > > [  255.666512]  (&mn->sem){+.+.}, at: [<7c59ba79>] 
> > > > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915]
> > > > [  255.666553] 
> > > >which lock already depends on the new lock.
> > > > 
> > > > [  255.666555] 
> > > >the existing dependency chain (in reverse order) is:
> > > > [  255.666557] 
> > > >-> #2 (&mn->sem){+.+.}:
> > > > [  255.666578]
> > > > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915]
> > > > [  255.666581]__mmu_notifier_invalidate_range_start+0x73/0xb0
> > > > [  255.666584]zap_page_range_single+0xcc/0xe0
> > > > [  255.666586]unmap_mapping_pages+0xd4/0x110
> > > > [  255.06]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> > > > [  255.25]i915_vma_unbind+0x60a/0xa10 [i915]
> > > > [  255.44]i915_gem_object_set_tiling+0xf6/0x5b0 [i915]
> > > > [  255.62]i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915]
> > > > [  255.65]drm_ioctl_kernel+0x60/0xa0
> > > > [  255.67]drm_ioctl+0x27e/0x320
> > > > [  255.69]do_vfs_ioctl+0x8a/0x670
> > > > [  255.70]SyS_ioctl+0x36/0x70
> > > > [  255.72]do_syscall_64+0x65/0x1a0
> > > > [  255.75]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > > > [  255.76] 
> > > >-> #1 (&mapping->i_mmap_rwsem){}:
> > > > [  255.80]unmap_mapping_pages+0x3d/0x110
> > > > [  255.98]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> > > > [  255.666716]i915_vma_unbind+0x60a/0xa10 [i915]
> > > > [  255.666734]i915_gem_object_unbind+0xa0/0x130 [i915]
> > > > [  255.666751]i915_gem_shrink+0x2d1/0x5d0 [i915]
> > > > [  255.666767]i915_drop_caches_set+0x92/0x190 [i915]
> > > > [  255.666770]simple_attr_write+0xab/0xc0
> > > > [  255.666772]full_proxy_write+0x4b/0x70
> > > > [  255.666774]__vfs_write+0x1e/0x130
> > > > [  255.666776]vfs_write+0xbd/0x1b0
> > > > [  255.666778]SyS_write+0x40/0xa0
> > > > [  255.666779]do_syscall_64+0x65/0x1a0
> > > > [  255.666781]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > > > [  255.666783] 
> > > >-> #0 (fs_reclaim){+.+.}:
> > > > [  255.666786]fs_reclaim_acquire.part.12+0x24/0x30
> > > > [  255.666788]__alloc_pages_nodemask+0x1f1/0x11d0
> > > > [  255.666790]__get_free_pages+0x9/0x40
> > > > [  255.666792]__pud_alloc+0x25/0xb0
> > > > [  255.666794]copy_page_range+0xa75/0xaf0
> > > > [  255.666796]copy_process.part.7+0x1267/0x1d90
> > > > [  255.666798]_do_fork+0xc0/0x6b0
> > > > [  255.666800]do_syscall_64+0x65/0x1a0
> > > > [  255.666801]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > > > [  255.666803] 
> > > >other info that might help us debug this:
> > > > 
> > > > [  255.666805] Chain exists of:
> > > >  fs_reclaim --> &mapping->i_mmap_rwsem --> &mn->sem
> > > > 
> > > > [  255.666809]  Possible unsafe locking scenario:
> > > > 
> > > > [  255.666811]CPU0CPU1

Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-27 Thread Daniel Vetter
On Tue, Mar 27, 2018 at 08:19:33AM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2018-03-27 07:48:00)
> > On Mon, Mar 26, 2018 at 11:38:55PM +0100, Chris Wilson wrote:
> > > Quoting Chris Wilson (2018-03-26 21:08:33)
> > > > Quoting Patchwork (2018-03-26 17:53:44)
> > > > > Test gem_userptr_blits:
> > > > > Subgroup coherency-unsync:
> > > > > pass   -> INCOMPLETE (shard-hsw)
> > > > 
> > > > Forgot that obj->userptr.mn may not exist.
> > > > 
> > > > > Subgroup dmabuf-sync:
> > > > > pass   -> DMESG-WARN (shard-hsw)
> > > > 
> > > > But this is the tricky lockdep one, warning of the recursion from gup
> > > > into mmu_invalidate_range, i.e.
> > > > 
> > > > down_read(&i915_mmu_notifier->sem);
> > > > down_read(&mm_struct->mmap_sem);
> > > > gup();
> > > > down_write(&i915_mmut_notifier->sem);
> > > > 
> > > > That seems a genuine deadlock... So I wonder how we managed to get a
> > > > lockdep splat and not a dead machine. Maybe gup never triggers the
> > > > recursion for our set of flags? Hmm.
> > > 
> > > In another universe, CI found
> > > 
> > > [  255.666496] ==
> > > [  255.666498] WARNING: possible circular locking dependency detected
> > > [  255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U  W   
> > > [  255.666502] --
> > > [  255.666503] gem_userptr_bli/4794 is trying to acquire lock:
> > > [  255.666505]  (fs_reclaim){+.+.}, at: [] 
> > > fs_reclaim_acquire.part.12+0x0/0x30
> > > [  255.666510] 
> > >but task is already holding lock:
> > > [  255.666512]  (&mn->sem){+.+.}, at: [<7c59ba79>] 
> > > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915]
> > > [  255.666553] 
> > >which lock already depends on the new lock.
> > > 
> > > [  255.666555] 
> > >the existing dependency chain (in reverse order) is:
> > > [  255.666557] 
> > >-> #2 (&mn->sem){+.+.}:
> > > [  255.666578]
> > > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915]
> > > [  255.666581]__mmu_notifier_invalidate_range_start+0x73/0xb0
> > > [  255.666584]zap_page_range_single+0xcc/0xe0
> > > [  255.666586]unmap_mapping_pages+0xd4/0x110
> > > [  255.06]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> > > [  255.25]i915_vma_unbind+0x60a/0xa10 [i915]
> > > [  255.44]i915_gem_object_set_tiling+0xf6/0x5b0 [i915]
> > > [  255.62]i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915]
> > > [  255.65]drm_ioctl_kernel+0x60/0xa0
> > > [  255.67]drm_ioctl+0x27e/0x320
> > > [  255.69]do_vfs_ioctl+0x8a/0x670
> > > [  255.70]SyS_ioctl+0x36/0x70
> > > [  255.72]do_syscall_64+0x65/0x1a0
> > > [  255.75]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > > [  255.76] 
> > >-> #1 (&mapping->i_mmap_rwsem){}:
> > > [  255.80]unmap_mapping_pages+0x3d/0x110
> > > [  255.98]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> > > [  255.666716]i915_vma_unbind+0x60a/0xa10 [i915]
> > > [  255.666734]i915_gem_object_unbind+0xa0/0x130 [i915]
> > > [  255.666751]i915_gem_shrink+0x2d1/0x5d0 [i915]
> > > [  255.666767]i915_drop_caches_set+0x92/0x190 [i915]
> > > [  255.666770]simple_attr_write+0xab/0xc0
> > > [  255.666772]full_proxy_write+0x4b/0x70
> > > [  255.666774]__vfs_write+0x1e/0x130
> > > [  255.666776]vfs_write+0xbd/0x1b0
> > > [  255.666778]SyS_write+0x40/0xa0
> > > [  255.666779]do_syscall_64+0x65/0x1a0
> > > [  255.666781]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > > [  255.666783] 
> > >-> #0 (fs_reclaim){+.+.}:
> > > [  255.666786]fs_reclaim_acquire.part.12+0x24/0x30
> > > [  255.666788]__alloc_pages_nodemask+0x1f1/0x11d0
> > > [  255.666790]__get_free_pages+0x9/0x40
> > > [  255.666792]__pud_alloc+0x25/0xb0
> > > [  255.666794]copy_page_range+0xa75/0xaf0
> > > [  255.666796]copy_process.part.7+0x1267/0x1d90
> > > [  255.666798]_do_fork+0xc0/0x6b0
> > > [  255.666800]do_syscall_64+0x65/0x1a0
> > > [  255.666801]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > > [  255.666803] 
> > >other info that might help us debug this:
> > > 
> > > [  255.666805] Chain exists of:
> > >  fs_reclaim --> &mapping->i_mmap_rwsem --> &mn->sem
> > > 
> > > [  255.666809]  Possible unsafe locking scenario:
> > > 
> > > [  255.666811]CPU0CPU1
> > > [  255.666812]
> > > [  255.666814]   lock(&mn->sem);
> > > [  255.666815]
> > > lock(&mapping->i_mmap_rwsem);
> > > [  255.666817]  

Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-27 Thread Chris Wilson
Quoting Daniel Vetter (2018-03-27 08:01:17)
> On Mon, Mar 26, 2018 at 09:08:33PM +0100, Chris Wilson wrote:
> > Quoting Patchwork (2018-03-26 17:53:44)
> > > Test gem_userptr_blits:
> > > Subgroup coherency-unsync:
> > > pass   -> INCOMPLETE (shard-hsw)
> > 
> > Forgot that obj->userptr.mn may not exist.
> > 
> > > Subgroup dmabuf-sync:
> > > pass   -> DMESG-WARN (shard-hsw)
> > 
> > But this is the tricky lockdep one, warning of the recursion from gup
> > into mmu_invalidate_range, i.e.
> > 
> > down_read(&i915_mmu_notifier->sem);
> > down_read(&mm_struct->mmap_sem);
> >   gup();
> >   down_write(&i915_mmut_notifier->sem);
> > 
> > That seems a genuine deadlock... So I wonder how we managed to get a
> > lockdep splat and not a dead machine. Maybe gup never triggers the
> > recursion for our set of flags? Hmm.
> 
> Coffee starting to kick in. If we gup a range it's likely the mm won't
> kick out the same range, but something else. I guess we'd need a really
> huge userptr bo which can't fit into core completely to actually have a
> reliably chance at triggering this. Would probably deadlock the box :-/
> 
> I think Jerome's recommendation is the sequence counter stuff from kvm,
> plus retrying forever on the gup side. That would convert the same
> deadlock into a livelock, but well can't have it all :-)

Pre-coffee state also thinks it would trigger the second fs_reclaim
lockdep if it was sufficiently annotated.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-27 Thread Chris Wilson
Quoting Daniel Vetter (2018-03-27 07:48:00)
> On Mon, Mar 26, 2018 at 11:38:55PM +0100, Chris Wilson wrote:
> > Quoting Chris Wilson (2018-03-26 21:08:33)
> > > Quoting Patchwork (2018-03-26 17:53:44)
> > > > Test gem_userptr_blits:
> > > > Subgroup coherency-unsync:
> > > > pass   -> INCOMPLETE (shard-hsw)
> > > 
> > > Forgot that obj->userptr.mn may not exist.
> > > 
> > > > Subgroup dmabuf-sync:
> > > > pass   -> DMESG-WARN (shard-hsw)
> > > 
> > > But this is the tricky lockdep one, warning of the recursion from gup
> > > into mmu_invalidate_range, i.e.
> > > 
> > > down_read(&i915_mmu_notifier->sem);
> > > down_read(&mm_struct->mmap_sem);
> > > gup();
> > > down_write(&i915_mmut_notifier->sem);
> > > 
> > > That seems a genuine deadlock... So I wonder how we managed to get a
> > > lockdep splat and not a dead machine. Maybe gup never triggers the
> > > recursion for our set of flags? Hmm.
> > 
> > In another universe, CI found
> > 
> > [  255.666496] ==
> > [  255.666498] WARNING: possible circular locking dependency detected
> > [  255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U  W   
> > [  255.666502] --
> > [  255.666503] gem_userptr_bli/4794 is trying to acquire lock:
> > [  255.666505]  (fs_reclaim){+.+.}, at: [] 
> > fs_reclaim_acquire.part.12+0x0/0x30
> > [  255.666510] 
> >but task is already holding lock:
> > [  255.666512]  (&mn->sem){+.+.}, at: [<7c59ba79>] 
> > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915]
> > [  255.666553] 
> >which lock already depends on the new lock.
> > 
> > [  255.666555] 
> >the existing dependency chain (in reverse order) is:
> > [  255.666557] 
> >-> #2 (&mn->sem){+.+.}:
> > [  255.666578]i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 
> > [i915]
> > [  255.666581]__mmu_notifier_invalidate_range_start+0x73/0xb0
> > [  255.666584]zap_page_range_single+0xcc/0xe0
> > [  255.666586]unmap_mapping_pages+0xd4/0x110
> > [  255.06]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> > [  255.25]i915_vma_unbind+0x60a/0xa10 [i915]
> > [  255.44]i915_gem_object_set_tiling+0xf6/0x5b0 [i915]
> > [  255.62]i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915]
> > [  255.65]drm_ioctl_kernel+0x60/0xa0
> > [  255.67]drm_ioctl+0x27e/0x320
> > [  255.69]do_vfs_ioctl+0x8a/0x670
> > [  255.70]SyS_ioctl+0x36/0x70
> > [  255.72]do_syscall_64+0x65/0x1a0
> > [  255.75]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > [  255.76] 
> >-> #1 (&mapping->i_mmap_rwsem){}:
> > [  255.80]unmap_mapping_pages+0x3d/0x110
> > [  255.98]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> > [  255.666716]i915_vma_unbind+0x60a/0xa10 [i915]
> > [  255.666734]i915_gem_object_unbind+0xa0/0x130 [i915]
> > [  255.666751]i915_gem_shrink+0x2d1/0x5d0 [i915]
> > [  255.666767]i915_drop_caches_set+0x92/0x190 [i915]
> > [  255.666770]simple_attr_write+0xab/0xc0
> > [  255.666772]full_proxy_write+0x4b/0x70
> > [  255.666774]__vfs_write+0x1e/0x130
> > [  255.666776]vfs_write+0xbd/0x1b0
> > [  255.666778]SyS_write+0x40/0xa0
> > [  255.666779]do_syscall_64+0x65/0x1a0
> > [  255.666781]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > [  255.666783] 
> >-> #0 (fs_reclaim){+.+.}:
> > [  255.666786]fs_reclaim_acquire.part.12+0x24/0x30
> > [  255.666788]__alloc_pages_nodemask+0x1f1/0x11d0
> > [  255.666790]__get_free_pages+0x9/0x40
> > [  255.666792]__pud_alloc+0x25/0xb0
> > [  255.666794]copy_page_range+0xa75/0xaf0
> > [  255.666796]copy_process.part.7+0x1267/0x1d90
> > [  255.666798]_do_fork+0xc0/0x6b0
> > [  255.666800]do_syscall_64+0x65/0x1a0
> > [  255.666801]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > [  255.666803] 
> >other info that might help us debug this:
> > 
> > [  255.666805] Chain exists of:
> >  fs_reclaim --> &mapping->i_mmap_rwsem --> &mn->sem
> > 
> > [  255.666809]  Possible unsafe locking scenario:
> > 
> > [  255.666811]CPU0CPU1
> > [  255.666812]
> > [  255.666814]   lock(&mn->sem);
> > [  255.666815]lock(&mapping->i_mmap_rwsem);
> > [  255.666817]lock(&mn->sem);
> > [  255.666819]   lock(fs_reclaim);
> > [  255.666821] 
> > 
> > So a shrinker deadlock. That doesn't look easy to wriggle out of, as we
> > have a random chunk of code that's between invalidate_range_start and
> > invalidate_range_end.
> 
> Chri

Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-27 Thread Daniel Vetter
On Mon, Mar 26, 2018 at 09:08:33PM +0100, Chris Wilson wrote:
> Quoting Patchwork (2018-03-26 17:53:44)
> > Test gem_userptr_blits:
> > Subgroup coherency-unsync:
> > pass   -> INCOMPLETE (shard-hsw)
> 
> Forgot that obj->userptr.mn may not exist.
> 
> > Subgroup dmabuf-sync:
> > pass   -> DMESG-WARN (shard-hsw)
> 
> But this is the tricky lockdep one, warning of the recursion from gup
> into mmu_invalidate_range, i.e.
> 
> down_read(&i915_mmu_notifier->sem);
> down_read(&mm_struct->mmap_sem);
>   gup();
>   down_write(&i915_mmut_notifier->sem);
> 
> That seems a genuine deadlock... So I wonder how we managed to get a
> lockdep splat and not a dead machine. Maybe gup never triggers the
> recursion for our set of flags? Hmm.

Coffee starting to kick in. If we gup a range it's likely the mm won't
kick out the same range, but something else. I guess we'd need a really
huge userptr bo which can't fit into core completely to actually have a
reliably chance at triggering this. Would probably deadlock the box :-/

I think Jerome's recommendation is the sequence counter stuff from kvm,
plus retrying forever on the gup side. That would convert the same
deadlock into a livelock, but well can't have it all :-) And I think once
you've killed the task the gup worker hopefully realizes it's wasting time
and gives up.

For the kvm stuff: Look at #intel-gfx scrollback, we discussed all the
necessary bits. Plus Jerome showed some new helpers that would avoid the
hand-rolling.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-26 Thread Daniel Vetter
On Mon, Mar 26, 2018 at 11:38:55PM +0100, Chris Wilson wrote:
> Quoting Chris Wilson (2018-03-26 21:08:33)
> > Quoting Patchwork (2018-03-26 17:53:44)
> > > Test gem_userptr_blits:
> > > Subgroup coherency-unsync:
> > > pass   -> INCOMPLETE (shard-hsw)
> > 
> > Forgot that obj->userptr.mn may not exist.
> > 
> > > Subgroup dmabuf-sync:
> > > pass   -> DMESG-WARN (shard-hsw)
> > 
> > But this is the tricky lockdep one, warning of the recursion from gup
> > into mmu_invalidate_range, i.e.
> > 
> > down_read(&i915_mmu_notifier->sem);
> > down_read(&mm_struct->mmap_sem);
> > gup();
> > down_write(&i915_mmut_notifier->sem);
> > 
> > That seems a genuine deadlock... So I wonder how we managed to get a
> > lockdep splat and not a dead machine. Maybe gup never triggers the
> > recursion for our set of flags? Hmm.
> 
> In another universe, CI found
> 
> [  255.666496] ==
> [  255.666498] WARNING: possible circular locking dependency detected
> [  255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U  W   
> [  255.666502] --
> [  255.666503] gem_userptr_bli/4794 is trying to acquire lock:
> [  255.666505]  (fs_reclaim){+.+.}, at: [] 
> fs_reclaim_acquire.part.12+0x0/0x30
> [  255.666510] 
>but task is already holding lock:
> [  255.666512]  (&mn->sem){+.+.}, at: [<7c59ba79>] 
> i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915]
> [  255.666553] 
>which lock already depends on the new lock.
> 
> [  255.666555] 
>the existing dependency chain (in reverse order) is:
> [  255.666557] 
>-> #2 (&mn->sem){+.+.}:
> [  255.666578]i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 
> [i915]
> [  255.666581]__mmu_notifier_invalidate_range_start+0x73/0xb0
> [  255.666584]zap_page_range_single+0xcc/0xe0
> [  255.666586]unmap_mapping_pages+0xd4/0x110
> [  255.06]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> [  255.25]i915_vma_unbind+0x60a/0xa10 [i915]
> [  255.44]i915_gem_object_set_tiling+0xf6/0x5b0 [i915]
> [  255.62]i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915]
> [  255.65]drm_ioctl_kernel+0x60/0xa0
> [  255.67]drm_ioctl+0x27e/0x320
> [  255.69]do_vfs_ioctl+0x8a/0x670
> [  255.70]SyS_ioctl+0x36/0x70
> [  255.72]do_syscall_64+0x65/0x1a0
> [  255.75]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> [  255.76] 
>-> #1 (&mapping->i_mmap_rwsem){}:
> [  255.80]unmap_mapping_pages+0x3d/0x110
> [  255.98]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> [  255.666716]i915_vma_unbind+0x60a/0xa10 [i915]
> [  255.666734]i915_gem_object_unbind+0xa0/0x130 [i915]
> [  255.666751]i915_gem_shrink+0x2d1/0x5d0 [i915]
> [  255.666767]i915_drop_caches_set+0x92/0x190 [i915]
> [  255.666770]simple_attr_write+0xab/0xc0
> [  255.666772]full_proxy_write+0x4b/0x70
> [  255.666774]__vfs_write+0x1e/0x130
> [  255.666776]vfs_write+0xbd/0x1b0
> [  255.666778]SyS_write+0x40/0xa0
> [  255.666779]do_syscall_64+0x65/0x1a0
> [  255.666781]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> [  255.666783] 
>-> #0 (fs_reclaim){+.+.}:
> [  255.666786]fs_reclaim_acquire.part.12+0x24/0x30
> [  255.666788]__alloc_pages_nodemask+0x1f1/0x11d0
> [  255.666790]__get_free_pages+0x9/0x40
> [  255.666792]__pud_alloc+0x25/0xb0
> [  255.666794]copy_page_range+0xa75/0xaf0
> [  255.666796]copy_process.part.7+0x1267/0x1d90
> [  255.666798]_do_fork+0xc0/0x6b0
> [  255.666800]do_syscall_64+0x65/0x1a0
> [  255.666801]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> [  255.666803] 
>other info that might help us debug this:
> 
> [  255.666805] Chain exists of:
>  fs_reclaim --> &mapping->i_mmap_rwsem --> &mn->sem
> 
> [  255.666809]  Possible unsafe locking scenario:
> 
> [  255.666811]CPU0CPU1
> [  255.666812]
> [  255.666814]   lock(&mn->sem);
> [  255.666815]lock(&mapping->i_mmap_rwsem);
> [  255.666817]lock(&mn->sem);
> [  255.666819]   lock(fs_reclaim);
> [  255.666821] 
> 
> So a shrinker deadlock. That doesn't look easy to wriggle out of, as we
> have a random chunk of code that's between invalidate_range_start and
> invalidate_range_end.

Christian König said something like "with this design you can't allocate
anything while holding locks you might need from the mmu notifier".
Because reclaim eats into the mmu notifiers.

But hey it's before coffee, so probably best you just ignore me :-)
-Dani

Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-26 Thread Chris Wilson
Quoting Chris Wilson (2018-03-26 21:08:33)
> Quoting Patchwork (2018-03-26 17:53:44)
> > Test gem_userptr_blits:
> > Subgroup coherency-unsync:
> > pass   -> INCOMPLETE (shard-hsw)
> 
> Forgot that obj->userptr.mn may not exist.
> 
> > Subgroup dmabuf-sync:
> > pass   -> DMESG-WARN (shard-hsw)
> 
> But this is the tricky lockdep one, warning of the recursion from gup
> into mmu_invalidate_range, i.e.
> 
> down_read(&i915_mmu_notifier->sem);
> down_read(&mm_struct->mmap_sem);
> gup();
> down_write(&i915_mmut_notifier->sem);
> 
> That seems a genuine deadlock... So I wonder how we managed to get a
> lockdep splat and not a dead machine. Maybe gup never triggers the
> recursion for our set of flags? Hmm.

In another universe, CI found

[  255.666496] ==
[  255.666498] WARNING: possible circular locking dependency detected
[  255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U  W   
[  255.666502] --
[  255.666503] gem_userptr_bli/4794 is trying to acquire lock:
[  255.666505]  (fs_reclaim){+.+.}, at: [] 
fs_reclaim_acquire.part.12+0x0/0x30
[  255.666510] 
   but task is already holding lock:
[  255.666512]  (&mn->sem){+.+.}, at: [<7c59ba79>] 
i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915]
[  255.666553] 
   which lock already depends on the new lock.

[  255.666555] 
   the existing dependency chain (in reverse order) is:
[  255.666557] 
   -> #2 (&mn->sem){+.+.}:
[  255.666578]i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 
[i915]
[  255.666581]__mmu_notifier_invalidate_range_start+0x73/0xb0
[  255.666584]zap_page_range_single+0xcc/0xe0
[  255.666586]unmap_mapping_pages+0xd4/0x110
[  255.06]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
[  255.25]i915_vma_unbind+0x60a/0xa10 [i915]
[  255.44]i915_gem_object_set_tiling+0xf6/0x5b0 [i915]
[  255.62]i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915]
[  255.65]drm_ioctl_kernel+0x60/0xa0
[  255.67]drm_ioctl+0x27e/0x320
[  255.69]do_vfs_ioctl+0x8a/0x670
[  255.70]SyS_ioctl+0x36/0x70
[  255.72]do_syscall_64+0x65/0x1a0
[  255.75]entry_SYSCALL_64_after_hwframe+0x42/0xb7
[  255.76] 
   -> #1 (&mapping->i_mmap_rwsem){}:
[  255.80]unmap_mapping_pages+0x3d/0x110
[  255.98]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
[  255.666716]i915_vma_unbind+0x60a/0xa10 [i915]
[  255.666734]i915_gem_object_unbind+0xa0/0x130 [i915]
[  255.666751]i915_gem_shrink+0x2d1/0x5d0 [i915]
[  255.666767]i915_drop_caches_set+0x92/0x190 [i915]
[  255.666770]simple_attr_write+0xab/0xc0
[  255.666772]full_proxy_write+0x4b/0x70
[  255.666774]__vfs_write+0x1e/0x130
[  255.666776]vfs_write+0xbd/0x1b0
[  255.666778]SyS_write+0x40/0xa0
[  255.666779]do_syscall_64+0x65/0x1a0
[  255.666781]entry_SYSCALL_64_after_hwframe+0x42/0xb7
[  255.666783] 
   -> #0 (fs_reclaim){+.+.}:
[  255.666786]fs_reclaim_acquire.part.12+0x24/0x30
[  255.666788]__alloc_pages_nodemask+0x1f1/0x11d0
[  255.666790]__get_free_pages+0x9/0x40
[  255.666792]__pud_alloc+0x25/0xb0
[  255.666794]copy_page_range+0xa75/0xaf0
[  255.666796]copy_process.part.7+0x1267/0x1d90
[  255.666798]_do_fork+0xc0/0x6b0
[  255.666800]do_syscall_64+0x65/0x1a0
[  255.666801]entry_SYSCALL_64_after_hwframe+0x42/0xb7
[  255.666803] 
   other info that might help us debug this:

[  255.666805] Chain exists of:
 fs_reclaim --> &mapping->i_mmap_rwsem --> &mn->sem

[  255.666809]  Possible unsafe locking scenario:

[  255.666811]CPU0CPU1
[  255.666812]
[  255.666814]   lock(&mn->sem);
[  255.666815]lock(&mapping->i_mmap_rwsem);
[  255.666817]lock(&mn->sem);
[  255.666819]   lock(fs_reclaim);
[  255.666821] 

So a shrinker deadlock. That doesn't look easy to wriggle out of, as we
have a random chunk of code that's between invalidate_range_start and
invalidate_range_end.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-26 Thread Chris Wilson
Quoting Patchwork (2018-03-26 17:53:44)
> Test gem_userptr_blits:
> Subgroup coherency-unsync:
> pass   -> INCOMPLETE (shard-hsw)

Forgot that obj->userptr.mn may not exist.

> Subgroup dmabuf-sync:
> pass   -> DMESG-WARN (shard-hsw)

But this is the tricky lockdep one, warning of the recursion from gup
into mmu_invalidate_range, i.e.

down_read(&i915_mmu_notifier->sem);
down_read(&mm_struct->mmap_sem);
gup();
down_write(&i915_mmut_notifier->sem);

That seems a genuine deadlock... So I wonder how we managed to get a
lockdep splat and not a dead machine. Maybe gup never triggers the
recursion for our set of flags? Hmm.
-Chris

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx