TBA/TMA were relocated to the upper half of the canonical address space. I don't think that qualifies as 32-bit by definition. But maybe you're using a different definition.

That said, if Mesa manages its own virtual address space in user mode, and KFD maps the TMA/TBA at an address that Mesa believes to be free, I can see how that would lead to problems.

That said, the fence refcount bug is another problem that may have been exposed by the way that a crashing Mesa application shuts down. Reverting Jay's patch certainly didn't fix that, but only hides the problem.

Regards,
  Felix


On 2024-01-04 13:29, Marek Olšák wrote:
Hi,

I have received information that the original commit makes all 32-bit
userspace VA allocations fail, so UMDs like Mesa can't even initialize
and they either crash or fail to load. If TBA/TMA was relocated to the
32-bit address range, it would explain why UMDs can't allocate
anything in that range.

Marek

On Wed, Jan 3, 2024 at 2:50 PM Jay Cornwall <[email protected]> wrote:
On 1/3/2024 12:58, Felix Kuehling wrote:

A segfault in Mesa seems to be a different issue from what's mentioned
in the commit message. I'd let Christian or Marek comment on
compatibility with graphics UMDs. I'm not sure why this patch would
affect them at all.
I was referencing this issue in OpenCL/OpenGL interop, which certainly looked 
related:

[   91.769002] amdgpu 0000:0a:00.0: amdgpu: bo 000000009bba4692 va 
0x0800000000-0x08000001ff conflict with 0x0800000000-0x0800000002
[   91.769141] ocltst[2781]: segfault at b2 ip 00007f3fb90a7c39 sp 
00007ffd3c011ba0 error 4 in radeonsi_dri.so[7f3fb888e000+1196000] likely on CPU 
15 (core 7, socket 0)

Looking at the logs in the tickets, it looks like a fence reference
counting error. I don't see how Jay's patch could have caused that. I
made another change in that code recently that could make a difference
for this issue:

     commit 8f08c5b24ced1be7eb49692e4816c1916233c79b
     Author: Felix Kuehling <[email protected]>
     Date:   Fri Oct 27 18:21:55 2023 -0400

          drm/amdkfd: Run restore_workers on freezable WQs

          Make restore workers freezable so we don't have to explicitly
     flush them
          in suspend and GPU reset code paths, and we don't accidentally
     try to
          restore BOs while the GPU is suspended. Not having to flush
     restore_work
          also helps avoid lock/fence dependencies in the GPU reset case
     where we're
          not allowed to wait for fences.

          A side effect of this is, that we can now have multiple
     concurrent threads
          trying to signal the same eviction fence. Rework eviction fence
     signaling
          and replacement to account for that.

          The GPU reset path can no longer rely on restore_process_worker
     to resume
          queues because evict/restore workers can run independently of
     it. Instead
          call a new restore_process_helper directly.

          This is an RFC and request for testing.

          v2:
          - Reworked eviction fence signaling
          - Introduced restore_process_helper

          v3:
          - Handle unsignaled eviction fences in restore_process_bos

          Signed-off-by: Felix Kuehling <[email protected]>
          Acked-by: Christian König <[email protected]>
          Tested-by: Emily Deng <[email protected]>
          Signed-off-by: Alex Deucher <[email protected]>


FWIW, I built a plain 6.6 kernel, and was not able to reproduce the
crash with some simple tests.

Regards,
    Felix


So I agree, let's revert it.

Reviewed-by: Jay Cornwall <[email protected]>

Reply via email to