On Mon, Apr 14, 2025 at 9:19 PM Erico Nunes <nunes.er...@gmail.com> wrote: > > On Thu, Apr 10, 2025 at 4:04 PM Christian König > <christian.koe...@amd.com> wrote: > > > > Am 10.04.25 um 15:56 schrieb Qiang Yu: > > >>>> This prevents applications with multiple contexts from running into a > > >>>> race condition between running tasks and context destroy when > > >>>> terminating. > > >>>> > > > If implementing flush callback fix the problem, it must not be when > > > SIGKILL. > > > > SIGKILL also calls flush (again eventually), but we can detect that in the > > scheduler code. > > > > See the code and comment in drm_sched_entity_flush(). We could potentially > > improve the comment cause it's not 100% correct, but it should work under > > all cases. > > > > > Could you describe the problem and how this fix solves it? As I don't > > > understand > > > how the above difference could fix a race bug. > > > > That was the point I wasn't sure about either. It should *not* fix any > > race, it's just gives a nicer SIGKILL experience. > > Thanks for this feedback. So as mentioned in the initial letter, I'm > also trying to understand if this is the correct fix. > > This problem is unfortunately not trivial to reproduce, there is only > one application which can reproduce this so far and it is a > complicated one with multiple contexts and relatively lenghty jobs. > What I observed so far is that as it is right now, a context might be > destroyed while a job is running (e.g. by killing the application at > the right time), and a context destroy appears to release buffers that > are still being used by the running job which is what causes the read > faults.
Free buffer in context destroy when job running is the bug, we should dig into it how it happens. And if multiple context play a key role. > This added flush made it so that the jobs always finish before the > context destroy gets called, which prevents the issue for this > application in my attempts. But I suppose it might also just be that > it had some more time to finish now. If this is not the correct fix > then there might be some more missing synchronization between running > job and context destroy in lima? > I'm afraid this patch does not fix the root cause, we should find out the root cause first. This patch could be added as SIGKILL improvement later. > Thanks > > Erico