Re: [Intel-gfx] [PATCH] drm/i915: Verify the engine after acquiring the active.lock

Tvrtko Ursulin Wed, 18 Sep 2019 09:29:52 -0700


On 18/09/2019 17:10, Chris Wilson wrote:

Quoting Tvrtko Ursulin (2019-09-18 16:54:36)


On 17/09/2019 16:17, Chris Wilson wrote:

Quoting Tvrtko Ursulin (2019-09-17 15:59:25)


On 16/09/2019 12:38, Chris Wilson wrote:

When using virtual engines, the rq->engine is not stable until we hold
the engine->active.lock (as the virtual engine may be exchanged with the
sibling). Since commit 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy")
we may retire a request concurrently with resubmitting it to HW, we need
to be extra careful to verify we are holding the correct lock for the
request's active list. This is similar to the issue we saw with
rescheduling the virtual requests, see sched_lock_engine().

Or else:

<4> [876.736126] list_add corruption. prev->next should be next 
(ffff8883f931a1f8), but was dead000000000100. (prev=ffff888361ffa610).

...

<4> [876.736415] list_del corruption. prev->next should be ffff888361ffca10, 
but was ffff88840ac2c730

Yes. So preempt-to-busy introduces a window where the request is still
on HW but we have returned it back to the submission queue. We catch up
with the HW on the next process_csb, but it may have completed the
request in the mean time (it is just not allowed to advance beyond the
subsequent breadcrumb and so prevented from overtaking our knowledge of
RING_TAIL and so we avoid telling the HW to go "backwards".).


Would it be sufficient to do:

    engine = READ_ONCE(rq->engine);
    spin_lock(...);
    list_del(...);
    spin_unlock(engine->active.lock);

To ensure the same engine is used? Although the oops is not about
spinlock but list corruption. How does the list get corrupt though?
list_del does not care on which list the request is.. If it is really
key to have the correct lock, then why it is enough to re-check the
engine after taking the lock? Why rq->engine couldn't change under the
lock again? rq->engine does get updated under the very lock, no?


Don't forget that list_del changes the list around it:
list_del() {
        list->prev->next = list->next;
        list->next->prev = list->prev;
}

rq->engine can't change under the real->active.lock, as the assignment
to rq->engine = (virtual, real) is made under the real->active.lock.


Sheehs.. I am re-inventing paradigms here.. :(

execlists_dequeue:
        real->active.lock
        ve->active.lock

__unwind_incomplete_requests:
        real->active.lock

Hmm. I trust the trick employed in the patch is well proven by this
point, but if we took the nested ve lock inside __unwind, do we need to
worry. Hmm.

Might be nicer indeed. We would only have ve->real nesting, never theopposite, right?


Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH] drm/i915: Verify the engine after acquiring the active.lock

Reply via email to