Re: [Intel-gfx] ✗ Fi.CI.BAT: failure for drm/i915: implement internal workqueues (rev3)

Coelho, Luciano Tue, 06 Jun 2023 07:31:08 -0700

On Tue, 2023-06-06 at 14:33 +0100, Tvrtko Ursulin wrote:
> On 06/06/2023 12:06, Coelho, Luciano wrote:
> > On Tue, 2023-06-06 at 11:06 +0100, Tvrtko Ursulin wrote:
> > > On 05/06/2023 16:06, Jani Nikula wrote:
> > > > On Wed, 31 May 2023, Patchwork <patchw...@emeril.freedesktop.org> wrote:
> > > > > #### Possible regressions ####
> > > > > 
> > > > >     * igt@gem_close_race@basic-process:
> > > > >       - fi-blb-e6850:       [PASS][1] -> [ABORT][2]
> > > > >      [1]: 
> > > > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-blb-e6850/igt@gem_close_r...@basic-process.html
> > > > >      [2]: 
> > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-blb-e6850/igt@gem_close_r...@basic-process.html
> > > > >       - fi-hsw-4770:        [PASS][3] -> [ABORT][4]
> > > > >      [3]: 
> > > > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-hsw-4770/igt@gem_close_r...@basic-process.html
> > > > >      [4]: 
> > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-hsw-4770/igt@gem_close_r...@basic-process.html
> > > > >       - fi-elk-e7500:       [PASS][5] -> [ABORT][6]
> > > > >      [5]: 
> > > > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-elk-e7500/igt@gem_close_r...@basic-process.html
> > > > >      [6]: 
> > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-elk-e7500/igt@gem_close_r...@basic-process.html
> > > > > 
> > > > >     * igt@i915_selftest@live@evict:
> > > > >       - bat-adlp-9:         [PASS][7] -> [ABORT][8]
> > > > >      [7]: 
> > > > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlp-9/igt@i915_selftest@l...@evict.html
> > > > >      [8]: 
> > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlp-9/igt@i915_selftest@l...@evict.html
> > > > >       - bat-rpls-2:         [PASS][9] -> [ABORT][10]
> > > > >      [9]: 
> > > > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-2/igt@i915_selftest@l...@evict.html
> > > > >      [10]: 
> > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-2/igt@i915_selftest@l...@evict.html
> > > > >       - bat-adlm-1:         [PASS][11] -> [ABORT][12]
> > > > >      [11]: 
> > > > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlm-1/igt@i915_selftest@l...@evict.html
> > > > >      [12]: 
> > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlm-1/igt@i915_selftest@l...@evict.html
> > > > >       - bat-rpls-1:         [PASS][13] -> [ABORT][14]
> > > > >      [13]: 
> > > > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-1/igt@i915_selftest@l...@evict.html
> > > > >      [14]: 
> > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-1/igt@i915_selftest@l...@evict.html
> > > > 
> > > > This still fails consistently, I have no clue why, and the above aren't
> > > > even remotely related to display.
> > > > 
> > > > What now? Tvrtko?
> > > 
> > > Hmm..
> > > 
> > > <4> [46.782321] Chain exists of:
> > >     (wq_completion)i915 --> (work_completion)(&i915->mm.free_work) --> 
> > > &vm->mutex
> > > <4> [46.782329]  Possible unsafe locking scenario:
> > > <4> [46.782332]        CPU0                    CPU1
> > > <4> [46.782334]        ----                    ----
> > > <4> [46.782337]   lock(&vm->mutex);
> > > <4> [46.782340]                                
> > > lock((work_completion)(&i915->mm.free_work));
> > > <4> [46.782344]                                lock(&vm->mutex);
> > > <4> [46.782348]   lock((wq_completion)i915);
> > > 
> > > 
> > > "(wq_completion)i915"
> > > 
> > > So it's not about the new wq even. Perhaps it is this hunk:
> > > 
> > > --- a/drivers/gpu/drm/i915/intel_wakeref.c
> > > +++ b/drivers/gpu/drm/i915/intel_wakeref.c
> > > @@ -75,7 +75,7 @@ void __intel_wakeref_put_last(struct intel_wakeref *wf, 
> > > unsigned long flags)
> > >    
> > >           /* Assume we are not in process context and so cannot sleep. */
> > >           if (flags & INTEL_WAKEREF_PUT_ASYNC || 
> > > !mutex_trylock(&wf->mutex)) {
> > > -         mod_delayed_work(system_wq, &wf->work,
> > > +         mod_delayed_work(wf->i915->wq, &wf->work,
> > > 
> > > Transformation from this patch otherwise is system_wq with the new 
> > > unordered wq, so I'd try that first.
> > 
> > Indeed this seems to be exactly the block that is causing the issue.  I
> > was sort of bisecting through all these changes and reverting this one
> > prevents the lockdep splat from happening...
> > 
> > So there's something that needs to be synced with the system_wq here,
> > but what? I need to dig into it.
> 
> AFAICT it is saying that i915->mm.free_work and engine->wakeref.work 
> must not be on the same ordered wq. Otherwise execbuf call trace 
> flushing under vm->mutex can deadlock against the free worker trying to 
> grab vm->mutex. If engine->wakeref.work is on a separate unordered wq it 
> would be safe since then execution will not be serialized with the 
> free_work. So just using the new i915->unordered_wq for this hunk should 
> work.


Ah, great, thanks for the insight! I'll try it now and see how it goes.

--
Cheers,
Luca.

Re: [Intel-gfx] ✗ Fi.CI.BAT: failure for drm/i915: implement internal workqueues (rev3)

Reply via email to