Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-27 Thread Chris Wilson
Quoting Daniel Vetter (2018-03-27 11:01:09)
> On Tue, Mar 27, 2018 at 08:19:33AM +0100, Chris Wilson wrote:
> > Quoting Daniel Vetter (2018-03-27 07:48:00)
> > > On Mon, Mar 26, 2018 at 11:38:55PM +0100, Chris Wilson wrote:
> > > > Quoting Chris Wilson (2018-03-26 21:08:33)
> > > > > Quoting Patchwork (2018-03-26 17:53:44)
> > > > > > Test gem_userptr_blits:
> > > > > > Subgroup coherency-unsync:
> > > > > > pass   -> INCOMPLETE (shard-hsw)
> > > > > 
> > > > > Forgot that obj->userptr.mn may not exist.
> > > > > 
> > > > > > Subgroup dmabuf-sync:
> > > > > > pass   -> DMESG-WARN (shard-hsw)
> > > > > 
> > > > > But this is the tricky lockdep one, warning of the recursion from gup
> > > > > into mmu_invalidate_range, i.e.
> > > > > 
> > > > > down_read(_mmu_notifier->sem);
> > > > > down_read(_struct->mmap_sem);
> > > > > gup();
> > > > > down_write(_mmut_notifier->sem);
> > > > > 
> > > > > That seems a genuine deadlock... So I wonder how we managed to get a
> > > > > lockdep splat and not a dead machine. Maybe gup never triggers the
> > > > > recursion for our set of flags? Hmm.
> > > > 
> > > > In another universe, CI found
> > > > 
> > > > [  255.666496] ==
> > > > [  255.666498] WARNING: possible circular locking dependency detected
> > > > [  255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U  W   
> > > > [  255.666502] --
> > > > [  255.666503] gem_userptr_bli/4794 is trying to acquire lock:
> > > > [  255.666505]  (fs_reclaim){+.+.}, at: [] 
> > > > fs_reclaim_acquire.part.12+0x0/0x30
> > > > [  255.666510] 
> > > >but task is already holding lock:
> > > > [  255.666512]  (>sem){+.+.}, at: [<7c59ba79>] 
> > > > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915]
> > > > [  255.666553] 
> > > >which lock already depends on the new lock.
> > > > 
> > > > [  255.666555] 
> > > >the existing dependency chain (in reverse order) is:
> > > > [  255.666557] 
> > > >-> #2 (>sem){+.+.}:
> > > > [  255.666578]
> > > > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915]
> > > > [  255.666581]__mmu_notifier_invalidate_range_start+0x73/0xb0
> > > > [  255.666584]zap_page_range_single+0xcc/0xe0
> > > > [  255.666586]unmap_mapping_pages+0xd4/0x110
> > > > [  255.06]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> > > > [  255.25]i915_vma_unbind+0x60a/0xa10 [i915]
> > > > [  255.44]i915_gem_object_set_tiling+0xf6/0x5b0 [i915]
> > > > [  255.62]i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915]
> > > > [  255.65]drm_ioctl_kernel+0x60/0xa0
> > > > [  255.67]drm_ioctl+0x27e/0x320
> > > > [  255.69]do_vfs_ioctl+0x8a/0x670
> > > > [  255.70]SyS_ioctl+0x36/0x70
> > > > [  255.72]do_syscall_64+0x65/0x1a0
> > > > [  255.75]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > > > [  255.76] 
> > > >-> #1 (>i_mmap_rwsem){}:
> > > > [  255.80]unmap_mapping_pages+0x3d/0x110
> > > > [  255.98]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> > > > [  255.666716]i915_vma_unbind+0x60a/0xa10 [i915]
> > > > [  255.666734]i915_gem_object_unbind+0xa0/0x130 [i915]
> > > > [  255.666751]i915_gem_shrink+0x2d1/0x5d0 [i915]
> > > > [  255.666767]i915_drop_caches_set+0x92/0x190 [i915]
> > > > [  255.666770]simple_attr_write+0xab/0xc0
> > > > [  255.666772]full_proxy_write+0x4b/0x70
> > > > [  255.666774]__vfs_write+0x1e/0x130
> > > > [  255.666776]vfs_write+0xbd/0x1b0
> > > > [  255.666778]SyS_write+0x40/0xa0
> > > > [  255.666779]do_syscall_64+0x65/0x1a0
> > > > [  255.666781]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > > > [  255.666783] 
> > > >-> #0 (fs_reclaim){+.+.}:
> > > > [  255.666786]fs_reclaim_acquire.part.12+0x24/0x30
> > > > [  255.666788]__alloc_pages_nodemask+0x1f1/0x11d0
> > > > [  255.666790]__get_free_pages+0x9/0x40
> > > > [  255.666792]__pud_alloc+0x25/0xb0
> > > > [  255.666794]copy_page_range+0xa75/0xaf0
> > > > [  255.666796]copy_process.part.7+0x1267/0x1d90
> > > > [  255.666798]_do_fork+0xc0/0x6b0
> > > > [  255.666800]do_syscall_64+0x65/0x1a0
> > > > [  255.666801]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > > > [  255.666803] 
> > > >other info that might help us debug this:
> > > > 
> > > > [  255.666805] Chain exists of:
> > > >  fs_reclaim --> >i_mmap_rwsem --> >sem
> > > > 
> > > > [  255.666809]  Possible unsafe locking scenario:
> > > > 
> > > > [  255.666811]CPU0CPU1
> > > > [  255.666812]

Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-27 Thread Daniel Vetter
On Tue, Mar 27, 2018 at 08:19:33AM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2018-03-27 07:48:00)
> > On Mon, Mar 26, 2018 at 11:38:55PM +0100, Chris Wilson wrote:
> > > Quoting Chris Wilson (2018-03-26 21:08:33)
> > > > Quoting Patchwork (2018-03-26 17:53:44)
> > > > > Test gem_userptr_blits:
> > > > > Subgroup coherency-unsync:
> > > > > pass   -> INCOMPLETE (shard-hsw)
> > > > 
> > > > Forgot that obj->userptr.mn may not exist.
> > > > 
> > > > > Subgroup dmabuf-sync:
> > > > > pass   -> DMESG-WARN (shard-hsw)
> > > > 
> > > > But this is the tricky lockdep one, warning of the recursion from gup
> > > > into mmu_invalidate_range, i.e.
> > > > 
> > > > down_read(_mmu_notifier->sem);
> > > > down_read(_struct->mmap_sem);
> > > > gup();
> > > > down_write(_mmut_notifier->sem);
> > > > 
> > > > That seems a genuine deadlock... So I wonder how we managed to get a
> > > > lockdep splat and not a dead machine. Maybe gup never triggers the
> > > > recursion for our set of flags? Hmm.
> > > 
> > > In another universe, CI found
> > > 
> > > [  255.666496] ==
> > > [  255.666498] WARNING: possible circular locking dependency detected
> > > [  255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U  W   
> > > [  255.666502] --
> > > [  255.666503] gem_userptr_bli/4794 is trying to acquire lock:
> > > [  255.666505]  (fs_reclaim){+.+.}, at: [] 
> > > fs_reclaim_acquire.part.12+0x0/0x30
> > > [  255.666510] 
> > >but task is already holding lock:
> > > [  255.666512]  (>sem){+.+.}, at: [<7c59ba79>] 
> > > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915]
> > > [  255.666553] 
> > >which lock already depends on the new lock.
> > > 
> > > [  255.666555] 
> > >the existing dependency chain (in reverse order) is:
> > > [  255.666557] 
> > >-> #2 (>sem){+.+.}:
> > > [  255.666578]
> > > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915]
> > > [  255.666581]__mmu_notifier_invalidate_range_start+0x73/0xb0
> > > [  255.666584]zap_page_range_single+0xcc/0xe0
> > > [  255.666586]unmap_mapping_pages+0xd4/0x110
> > > [  255.06]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> > > [  255.25]i915_vma_unbind+0x60a/0xa10 [i915]
> > > [  255.44]i915_gem_object_set_tiling+0xf6/0x5b0 [i915]
> > > [  255.62]i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915]
> > > [  255.65]drm_ioctl_kernel+0x60/0xa0
> > > [  255.67]drm_ioctl+0x27e/0x320
> > > [  255.69]do_vfs_ioctl+0x8a/0x670
> > > [  255.70]SyS_ioctl+0x36/0x70
> > > [  255.72]do_syscall_64+0x65/0x1a0
> > > [  255.75]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > > [  255.76] 
> > >-> #1 (>i_mmap_rwsem){}:
> > > [  255.80]unmap_mapping_pages+0x3d/0x110
> > > [  255.98]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> > > [  255.666716]i915_vma_unbind+0x60a/0xa10 [i915]
> > > [  255.666734]i915_gem_object_unbind+0xa0/0x130 [i915]
> > > [  255.666751]i915_gem_shrink+0x2d1/0x5d0 [i915]
> > > [  255.666767]i915_drop_caches_set+0x92/0x190 [i915]
> > > [  255.666770]simple_attr_write+0xab/0xc0
> > > [  255.666772]full_proxy_write+0x4b/0x70
> > > [  255.666774]__vfs_write+0x1e/0x130
> > > [  255.666776]vfs_write+0xbd/0x1b0
> > > [  255.666778]SyS_write+0x40/0xa0
> > > [  255.666779]do_syscall_64+0x65/0x1a0
> > > [  255.666781]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > > [  255.666783] 
> > >-> #0 (fs_reclaim){+.+.}:
> > > [  255.666786]fs_reclaim_acquire.part.12+0x24/0x30
> > > [  255.666788]__alloc_pages_nodemask+0x1f1/0x11d0
> > > [  255.666790]__get_free_pages+0x9/0x40
> > > [  255.666792]__pud_alloc+0x25/0xb0
> > > [  255.666794]copy_page_range+0xa75/0xaf0
> > > [  255.666796]copy_process.part.7+0x1267/0x1d90
> > > [  255.666798]_do_fork+0xc0/0x6b0
> > > [  255.666800]do_syscall_64+0x65/0x1a0
> > > [  255.666801]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > > [  255.666803] 
> > >other info that might help us debug this:
> > > 
> > > [  255.666805] Chain exists of:
> > >  fs_reclaim --> >i_mmap_rwsem --> >sem
> > > 
> > > [  255.666809]  Possible unsafe locking scenario:
> > > 
> > > [  255.666811]CPU0CPU1
> > > [  255.666812]
> > > [  255.666814]   lock(>sem);
> > > [  255.666815]
> > > lock(>i_mmap_rwsem);
> > > [  255.666817]lock(>sem);
> > > [  255.666819]   lock(fs_reclaim);

Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-27 Thread Chris Wilson
Quoting Daniel Vetter (2018-03-27 08:01:17)
> On Mon, Mar 26, 2018 at 09:08:33PM +0100, Chris Wilson wrote:
> > Quoting Patchwork (2018-03-26 17:53:44)
> > > Test gem_userptr_blits:
> > > Subgroup coherency-unsync:
> > > pass   -> INCOMPLETE (shard-hsw)
> > 
> > Forgot that obj->userptr.mn may not exist.
> > 
> > > Subgroup dmabuf-sync:
> > > pass   -> DMESG-WARN (shard-hsw)
> > 
> > But this is the tricky lockdep one, warning of the recursion from gup
> > into mmu_invalidate_range, i.e.
> > 
> > down_read(_mmu_notifier->sem);
> > down_read(_struct->mmap_sem);
> >   gup();
> >   down_write(_mmut_notifier->sem);
> > 
> > That seems a genuine deadlock... So I wonder how we managed to get a
> > lockdep splat and not a dead machine. Maybe gup never triggers the
> > recursion for our set of flags? Hmm.
> 
> Coffee starting to kick in. If we gup a range it's likely the mm won't
> kick out the same range, but something else. I guess we'd need a really
> huge userptr bo which can't fit into core completely to actually have a
> reliably chance at triggering this. Would probably deadlock the box :-/
> 
> I think Jerome's recommendation is the sequence counter stuff from kvm,
> plus retrying forever on the gup side. That would convert the same
> deadlock into a livelock, but well can't have it all :-)

Pre-coffee state also thinks it would trigger the second fs_reclaim
lockdep if it was sufficiently annotated.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-27 Thread Chris Wilson
Quoting Daniel Vetter (2018-03-27 07:48:00)
> On Mon, Mar 26, 2018 at 11:38:55PM +0100, Chris Wilson wrote:
> > Quoting Chris Wilson (2018-03-26 21:08:33)
> > > Quoting Patchwork (2018-03-26 17:53:44)
> > > > Test gem_userptr_blits:
> > > > Subgroup coherency-unsync:
> > > > pass   -> INCOMPLETE (shard-hsw)
> > > 
> > > Forgot that obj->userptr.mn may not exist.
> > > 
> > > > Subgroup dmabuf-sync:
> > > > pass   -> DMESG-WARN (shard-hsw)
> > > 
> > > But this is the tricky lockdep one, warning of the recursion from gup
> > > into mmu_invalidate_range, i.e.
> > > 
> > > down_read(_mmu_notifier->sem);
> > > down_read(_struct->mmap_sem);
> > > gup();
> > > down_write(_mmut_notifier->sem);
> > > 
> > > That seems a genuine deadlock... So I wonder how we managed to get a
> > > lockdep splat and not a dead machine. Maybe gup never triggers the
> > > recursion for our set of flags? Hmm.
> > 
> > In another universe, CI found
> > 
> > [  255.666496] ==
> > [  255.666498] WARNING: possible circular locking dependency detected
> > [  255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U  W   
> > [  255.666502] --
> > [  255.666503] gem_userptr_bli/4794 is trying to acquire lock:
> > [  255.666505]  (fs_reclaim){+.+.}, at: [] 
> > fs_reclaim_acquire.part.12+0x0/0x30
> > [  255.666510] 
> >but task is already holding lock:
> > [  255.666512]  (>sem){+.+.}, at: [<7c59ba79>] 
> > i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915]
> > [  255.666553] 
> >which lock already depends on the new lock.
> > 
> > [  255.666555] 
> >the existing dependency chain (in reverse order) is:
> > [  255.666557] 
> >-> #2 (>sem){+.+.}:
> > [  255.666578]i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 
> > [i915]
> > [  255.666581]__mmu_notifier_invalidate_range_start+0x73/0xb0
> > [  255.666584]zap_page_range_single+0xcc/0xe0
> > [  255.666586]unmap_mapping_pages+0xd4/0x110
> > [  255.06]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> > [  255.25]i915_vma_unbind+0x60a/0xa10 [i915]
> > [  255.44]i915_gem_object_set_tiling+0xf6/0x5b0 [i915]
> > [  255.62]i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915]
> > [  255.65]drm_ioctl_kernel+0x60/0xa0
> > [  255.67]drm_ioctl+0x27e/0x320
> > [  255.69]do_vfs_ioctl+0x8a/0x670
> > [  255.70]SyS_ioctl+0x36/0x70
> > [  255.72]do_syscall_64+0x65/0x1a0
> > [  255.75]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > [  255.76] 
> >-> #1 (>i_mmap_rwsem){}:
> > [  255.80]unmap_mapping_pages+0x3d/0x110
> > [  255.98]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> > [  255.666716]i915_vma_unbind+0x60a/0xa10 [i915]
> > [  255.666734]i915_gem_object_unbind+0xa0/0x130 [i915]
> > [  255.666751]i915_gem_shrink+0x2d1/0x5d0 [i915]
> > [  255.666767]i915_drop_caches_set+0x92/0x190 [i915]
> > [  255.666770]simple_attr_write+0xab/0xc0
> > [  255.666772]full_proxy_write+0x4b/0x70
> > [  255.666774]__vfs_write+0x1e/0x130
> > [  255.666776]vfs_write+0xbd/0x1b0
> > [  255.666778]SyS_write+0x40/0xa0
> > [  255.666779]do_syscall_64+0x65/0x1a0
> > [  255.666781]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > [  255.666783] 
> >-> #0 (fs_reclaim){+.+.}:
> > [  255.666786]fs_reclaim_acquire.part.12+0x24/0x30
> > [  255.666788]__alloc_pages_nodemask+0x1f1/0x11d0
> > [  255.666790]__get_free_pages+0x9/0x40
> > [  255.666792]__pud_alloc+0x25/0xb0
> > [  255.666794]copy_page_range+0xa75/0xaf0
> > [  255.666796]copy_process.part.7+0x1267/0x1d90
> > [  255.666798]_do_fork+0xc0/0x6b0
> > [  255.666800]do_syscall_64+0x65/0x1a0
> > [  255.666801]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > [  255.666803] 
> >other info that might help us debug this:
> > 
> > [  255.666805] Chain exists of:
> >  fs_reclaim --> >i_mmap_rwsem --> >sem
> > 
> > [  255.666809]  Possible unsafe locking scenario:
> > 
> > [  255.666811]CPU0CPU1
> > [  255.666812]
> > [  255.666814]   lock(>sem);
> > [  255.666815]lock(>i_mmap_rwsem);
> > [  255.666817]lock(>sem);
> > [  255.666819]   lock(fs_reclaim);
> > [  255.666821] 
> > 
> > So a shrinker deadlock. That doesn't look easy to wriggle out of, as we
> > have a random chunk of code that's between invalidate_range_start and
> > invalidate_range_end.
> 
> Christian König said something like "with this design you can't 

Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-27 Thread Daniel Vetter
On Mon, Mar 26, 2018 at 09:08:33PM +0100, Chris Wilson wrote:
> Quoting Patchwork (2018-03-26 17:53:44)
> > Test gem_userptr_blits:
> > Subgroup coherency-unsync:
> > pass   -> INCOMPLETE (shard-hsw)
> 
> Forgot that obj->userptr.mn may not exist.
> 
> > Subgroup dmabuf-sync:
> > pass   -> DMESG-WARN (shard-hsw)
> 
> But this is the tricky lockdep one, warning of the recursion from gup
> into mmu_invalidate_range, i.e.
> 
> down_read(_mmu_notifier->sem);
> down_read(_struct->mmap_sem);
>   gup();
>   down_write(_mmut_notifier->sem);
> 
> That seems a genuine deadlock... So I wonder how we managed to get a
> lockdep splat and not a dead machine. Maybe gup never triggers the
> recursion for our set of flags? Hmm.

Coffee starting to kick in. If we gup a range it's likely the mm won't
kick out the same range, but something else. I guess we'd need a really
huge userptr bo which can't fit into core completely to actually have a
reliably chance at triggering this. Would probably deadlock the box :-/

I think Jerome's recommendation is the sequence counter stuff from kvm,
plus retrying forever on the gup side. That would convert the same
deadlock into a livelock, but well can't have it all :-) And I think once
you've killed the task the gup worker hopefully realizes it's wasting time
and gives up.

For the kvm stuff: Look at #intel-gfx scrollback, we discussed all the
necessary bits. Plus Jerome showed some new helpers that would avoid the
hand-rolling.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-27 Thread Daniel Vetter
On Mon, Mar 26, 2018 at 11:38:55PM +0100, Chris Wilson wrote:
> Quoting Chris Wilson (2018-03-26 21:08:33)
> > Quoting Patchwork (2018-03-26 17:53:44)
> > > Test gem_userptr_blits:
> > > Subgroup coherency-unsync:
> > > pass   -> INCOMPLETE (shard-hsw)
> > 
> > Forgot that obj->userptr.mn may not exist.
> > 
> > > Subgroup dmabuf-sync:
> > > pass   -> DMESG-WARN (shard-hsw)
> > 
> > But this is the tricky lockdep one, warning of the recursion from gup
> > into mmu_invalidate_range, i.e.
> > 
> > down_read(_mmu_notifier->sem);
> > down_read(_struct->mmap_sem);
> > gup();
> > down_write(_mmut_notifier->sem);
> > 
> > That seems a genuine deadlock... So I wonder how we managed to get a
> > lockdep splat and not a dead machine. Maybe gup never triggers the
> > recursion for our set of flags? Hmm.
> 
> In another universe, CI found
> 
> [  255.666496] ==
> [  255.666498] WARNING: possible circular locking dependency detected
> [  255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U  W   
> [  255.666502] --
> [  255.666503] gem_userptr_bli/4794 is trying to acquire lock:
> [  255.666505]  (fs_reclaim){+.+.}, at: [] 
> fs_reclaim_acquire.part.12+0x0/0x30
> [  255.666510] 
>but task is already holding lock:
> [  255.666512]  (>sem){+.+.}, at: [<7c59ba79>] 
> i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915]
> [  255.666553] 
>which lock already depends on the new lock.
> 
> [  255.666555] 
>the existing dependency chain (in reverse order) is:
> [  255.666557] 
>-> #2 (>sem){+.+.}:
> [  255.666578]i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 
> [i915]
> [  255.666581]__mmu_notifier_invalidate_range_start+0x73/0xb0
> [  255.666584]zap_page_range_single+0xcc/0xe0
> [  255.666586]unmap_mapping_pages+0xd4/0x110
> [  255.06]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> [  255.25]i915_vma_unbind+0x60a/0xa10 [i915]
> [  255.44]i915_gem_object_set_tiling+0xf6/0x5b0 [i915]
> [  255.62]i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915]
> [  255.65]drm_ioctl_kernel+0x60/0xa0
> [  255.67]drm_ioctl+0x27e/0x320
> [  255.69]do_vfs_ioctl+0x8a/0x670
> [  255.70]SyS_ioctl+0x36/0x70
> [  255.72]do_syscall_64+0x65/0x1a0
> [  255.75]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> [  255.76] 
>-> #1 (>i_mmap_rwsem){}:
> [  255.80]unmap_mapping_pages+0x3d/0x110
> [  255.98]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
> [  255.666716]i915_vma_unbind+0x60a/0xa10 [i915]
> [  255.666734]i915_gem_object_unbind+0xa0/0x130 [i915]
> [  255.666751]i915_gem_shrink+0x2d1/0x5d0 [i915]
> [  255.666767]i915_drop_caches_set+0x92/0x190 [i915]
> [  255.666770]simple_attr_write+0xab/0xc0
> [  255.666772]full_proxy_write+0x4b/0x70
> [  255.666774]__vfs_write+0x1e/0x130
> [  255.666776]vfs_write+0xbd/0x1b0
> [  255.666778]SyS_write+0x40/0xa0
> [  255.666779]do_syscall_64+0x65/0x1a0
> [  255.666781]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> [  255.666783] 
>-> #0 (fs_reclaim){+.+.}:
> [  255.666786]fs_reclaim_acquire.part.12+0x24/0x30
> [  255.666788]__alloc_pages_nodemask+0x1f1/0x11d0
> [  255.666790]__get_free_pages+0x9/0x40
> [  255.666792]__pud_alloc+0x25/0xb0
> [  255.666794]copy_page_range+0xa75/0xaf0
> [  255.666796]copy_process.part.7+0x1267/0x1d90
> [  255.666798]_do_fork+0xc0/0x6b0
> [  255.666800]do_syscall_64+0x65/0x1a0
> [  255.666801]entry_SYSCALL_64_after_hwframe+0x42/0xb7
> [  255.666803] 
>other info that might help us debug this:
> 
> [  255.666805] Chain exists of:
>  fs_reclaim --> >i_mmap_rwsem --> >sem
> 
> [  255.666809]  Possible unsafe locking scenario:
> 
> [  255.666811]CPU0CPU1
> [  255.666812]
> [  255.666814]   lock(>sem);
> [  255.666815]lock(>i_mmap_rwsem);
> [  255.666817]lock(>sem);
> [  255.666819]   lock(fs_reclaim);
> [  255.666821] 
> 
> So a shrinker deadlock. That doesn't look easy to wriggle out of, as we
> have a random chunk of code that's between invalidate_range_start and
> invalidate_range_end.

Christian König said something like "with this design you can't allocate
anything while holding locks you might need from the mmu notifier".
Because reclaim eats into the mmu notifiers.

But hey it's before coffee, so probably best you just ignore me :-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation

Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-26 Thread Chris Wilson
Quoting Chris Wilson (2018-03-26 21:08:33)
> Quoting Patchwork (2018-03-26 17:53:44)
> > Test gem_userptr_blits:
> > Subgroup coherency-unsync:
> > pass   -> INCOMPLETE (shard-hsw)
> 
> Forgot that obj->userptr.mn may not exist.
> 
> > Subgroup dmabuf-sync:
> > pass   -> DMESG-WARN (shard-hsw)
> 
> But this is the tricky lockdep one, warning of the recursion from gup
> into mmu_invalidate_range, i.e.
> 
> down_read(_mmu_notifier->sem);
> down_read(_struct->mmap_sem);
> gup();
> down_write(_mmut_notifier->sem);
> 
> That seems a genuine deadlock... So I wonder how we managed to get a
> lockdep splat and not a dead machine. Maybe gup never triggers the
> recursion for our set of flags? Hmm.

In another universe, CI found

[  255.666496] ==
[  255.666498] WARNING: possible circular locking dependency detected
[  255.666500] 4.16.0-rc6-CI-Trybot_1944+ #1 Tainted: G U  W   
[  255.666502] --
[  255.666503] gem_userptr_bli/4794 is trying to acquire lock:
[  255.666505]  (fs_reclaim){+.+.}, at: [] 
fs_reclaim_acquire.part.12+0x0/0x30
[  255.666510] 
   but task is already holding lock:
[  255.666512]  (>sem){+.+.}, at: [<7c59ba79>] 
i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 [i915]
[  255.666553] 
   which lock already depends on the new lock.

[  255.666555] 
   the existing dependency chain (in reverse order) is:
[  255.666557] 
   -> #2 (>sem){+.+.}:
[  255.666578]i915_gem_userptr_mn_invalidate_range_start+0x3e/0x1a0 
[i915]
[  255.666581]__mmu_notifier_invalidate_range_start+0x73/0xb0
[  255.666584]zap_page_range_single+0xcc/0xe0
[  255.666586]unmap_mapping_pages+0xd4/0x110
[  255.06]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
[  255.25]i915_vma_unbind+0x60a/0xa10 [i915]
[  255.44]i915_gem_object_set_tiling+0xf6/0x5b0 [i915]
[  255.62]i915_gem_set_tiling_ioctl+0x262/0x2f0 [i915]
[  255.65]drm_ioctl_kernel+0x60/0xa0
[  255.67]drm_ioctl+0x27e/0x320
[  255.69]do_vfs_ioctl+0x8a/0x670
[  255.70]SyS_ioctl+0x36/0x70
[  255.72]do_syscall_64+0x65/0x1a0
[  255.75]entry_SYSCALL_64_after_hwframe+0x42/0xb7
[  255.76] 
   -> #1 (>i_mmap_rwsem){}:
[  255.80]unmap_mapping_pages+0x3d/0x110
[  255.98]i915_vma_revoke_mmap+0x7e/0x1c0 [i915]
[  255.666716]i915_vma_unbind+0x60a/0xa10 [i915]
[  255.666734]i915_gem_object_unbind+0xa0/0x130 [i915]
[  255.666751]i915_gem_shrink+0x2d1/0x5d0 [i915]
[  255.666767]i915_drop_caches_set+0x92/0x190 [i915]
[  255.666770]simple_attr_write+0xab/0xc0
[  255.666772]full_proxy_write+0x4b/0x70
[  255.666774]__vfs_write+0x1e/0x130
[  255.666776]vfs_write+0xbd/0x1b0
[  255.666778]SyS_write+0x40/0xa0
[  255.666779]do_syscall_64+0x65/0x1a0
[  255.666781]entry_SYSCALL_64_after_hwframe+0x42/0xb7
[  255.666783] 
   -> #0 (fs_reclaim){+.+.}:
[  255.666786]fs_reclaim_acquire.part.12+0x24/0x30
[  255.666788]__alloc_pages_nodemask+0x1f1/0x11d0
[  255.666790]__get_free_pages+0x9/0x40
[  255.666792]__pud_alloc+0x25/0xb0
[  255.666794]copy_page_range+0xa75/0xaf0
[  255.666796]copy_process.part.7+0x1267/0x1d90
[  255.666798]_do_fork+0xc0/0x6b0
[  255.666800]do_syscall_64+0x65/0x1a0
[  255.666801]entry_SYSCALL_64_after_hwframe+0x42/0xb7
[  255.666803] 
   other info that might help us debug this:

[  255.666805] Chain exists of:
 fs_reclaim --> >i_mmap_rwsem --> >sem

[  255.666809]  Possible unsafe locking scenario:

[  255.666811]CPU0CPU1
[  255.666812]
[  255.666814]   lock(>sem);
[  255.666815]lock(>i_mmap_rwsem);
[  255.666817]lock(>sem);
[  255.666819]   lock(fs_reclaim);
[  255.666821] 

So a shrinker deadlock. That doesn't look easy to wriggle out of, as we
have a random chunk of code that's between invalidate_range_start and
invalidate_range_end.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-26 Thread Chris Wilson
Quoting Patchwork (2018-03-26 17:53:44)
> Test gem_userptr_blits:
> Subgroup coherency-unsync:
> pass   -> INCOMPLETE (shard-hsw)

Forgot that obj->userptr.mn may not exist.

> Subgroup dmabuf-sync:
> pass   -> DMESG-WARN (shard-hsw)

But this is the tricky lockdep one, warning of the recursion from gup
into mmu_invalidate_range, i.e.

down_read(_mmu_notifier->sem);
down_read(_struct->mmap_sem);
gup();
down_write(_mmut_notifier->sem);

That seems a genuine deadlock... So I wonder how we managed to get a
lockdep splat and not a dead machine. Maybe gup never triggers the
recursion for our set of flags? Hmm.
-Chris

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore

2018-03-26 Thread Patchwork
== Series Details ==

Series: drm/i915/userptr: Wrap mmu_notifier inside its own rw_semaphore
URL   : https://patchwork.freedesktop.org/series/40676/
State : failure

== Summary ==

 Possible new issues:

Test drv_module_reload:
Subgroup basic-no-display:
pass   -> DMESG-WARN (shard-snb)
Test gem_tiled_swapping:
Subgroup non-threaded:
pass   -> DMESG-WARN (shard-hsw)
Test gem_userptr_blits:
Subgroup coherency-unsync:
pass   -> INCOMPLETE (shard-hsw)
pass   -> INCOMPLETE (shard-snb)
Subgroup dmabuf-sync:
pass   -> DMESG-WARN (shard-hsw)
Subgroup dmabuf-unsync:
pass   -> DMESG-WARN (shard-snb)
Subgroup forbidden-operations:
pass   -> DMESG-FAIL (shard-apl)
pass   -> DMESG-FAIL (shard-hsw)
pass   -> DMESG-FAIL (shard-snb)
Subgroup invalid-gtt-mapping:
pass   -> INCOMPLETE (shard-apl)
pass   -> INCOMPLETE (shard-hsw)
pass   -> INCOMPLETE (shard-snb)
Subgroup invalid-null-pointer:
pass   -> INCOMPLETE (shard-apl)
pass   -> INCOMPLETE (shard-hsw)
pass   -> INCOMPLETE (shard-snb)
Subgroup map-fixed-invalidate-busy:
pass   -> DMESG-WARN (shard-hsw)
pass   -> DMESG-WARN (shard-snb)
Subgroup map-fixed-invalidate-busy-gup:
pass   -> DMESG-WARN (shard-apl)
pass   -> DMESG-WARN (shard-hsw)
pass   -> DMESG-WARN (shard-snb)
Subgroup map-fixed-invalidate-gup:
pass   -> DMESG-WARN (shard-snb)
Subgroup map-fixed-invalidate-overlap-busy:
pass   -> DMESG-WARN (shard-apl)
Subgroup map-fixed-invalidate-overlap-gup:
pass   -> DMESG-WARN (shard-apl)
pass   -> DMESG-WARN (shard-hsw)
pass   -> DMESG-WARN (shard-snb)
Subgroup process-exit-busy:
pass   -> DMESG-WARN (shard-hsw)
pass   -> DMESG-WARN (shard-snb)
Subgroup process-exit-gtt-busy:
pass   -> DMESG-WARN (shard-hsw)
pass   -> DMESG-WARN (shard-snb)
Subgroup sync-unmap-cycles:
pass   -> DMESG-WARN (shard-apl)
Subgroup unsync-unmap:
pass   -> INCOMPLETE (shard-apl)
pass   -> INCOMPLETE (shard-hsw)
pass   -> INCOMPLETE (shard-snb)
Subgroup unsync-unmap-after-close:
pass   -> INCOMPLETE (shard-apl)
pass   -> INCOMPLETE (shard-hsw)
pass   -> INCOMPLETE (shard-snb)
Subgroup unsync-unmap-cycles:
pass   -> INCOMPLETE (shard-apl)
pass   -> INCOMPLETE (shard-hsw)
pass   -> INCOMPLETE (shard-snb)
Test kms_addfb_basic:
Subgroup unused-handle:
pass   -> INCOMPLETE (shard-apl)

 Known issues:

Test kms_flip:
Subgroup 2x-dpms-vs-vblank-race:
fail   -> PASS   (shard-hsw) fdo#103060 +2
Subgroup 2x-flip-vs-expired-vblank-interruptible:
fail   -> PASS   (shard-hsw) fdo#102887
Subgroup 2x-flip-vs-wf_vblank-interruptible:
pass   -> FAIL   (shard-hsw) fdo#100368 +1
Test kms_rotation_crc:
Subgroup sprite-rotation-180:
fail   -> PASS   (shard-snb) fdo#103925
Subgroup sprite-rotation-270:
pass   -> FAIL   (shard-apl) fdo#103356

fdo#103060 https://bugs.freedesktop.org/show_bug.cgi?id=103060
fdo#102887 https://bugs.freedesktop.org/show_bug.cgi?id=102887
fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
fdo#103925 https://bugs.freedesktop.org/show_bug.cgi?id=103925
fdo#103356 https://bugs.freedesktop.org/show_bug.cgi?id=103356

shard-apltotal:3262 pass:1693 dwarn:5   dfail:1   fail:7   skip:1549 
time:10290s
shard-hswtotal:3235 pass:1638 dwarn:8   dfail:1   fail:2   skip:1579 
time:9745s
shard-snbtotal:3235 pass:1254 dwarn:9   dfail:1   fail:3   skip:1962 
time:5613s
Blacklisted hosts:
shard-kbltotal:3274 pass:1806 dwarn:13  dfail:2   fail:9   skip:1438 
time:7933s

== Logs ==

For more details see: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8492/shards.html
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx