On Wed, Sep 10, 2025 at 2:38 PM Timur Kristóf <timur.kris...@gmail.com> wrote:
>
> On Wed, 2025-09-10 at 11:10 -0400, Marek Olšák wrote:
> > I added the comment into Mesa that 0x3fff00 is the limit. I did
> > research on that bug separately from PAL, but I don't remember the
> > details.
> >
> > There is no performance to gain here. It's only about consistency and
> > clear communication to the public what the recommended SDMA
> > programming is.
> >
> > Marek
>
> Hi Christian & Marek,
>
> I'd be happy to submit a new patch series if we can agree on what the
> limits are for each SDMA version and how to get the best performance
> from it.
>
> I trust Marek's word that he did his research when he came up with
> 0x3fff00 so I don't understand the pushback here.
>
> (1) According to a conversation between Marek and myself, the most
> optimal way to program the SDMA would be to use 256-byte-aligned copy
> packets. Christian, do you agree with this? If yes, I can write a patch
> for that. If not, let me know what you would prefer.
>
> (2) I'd like us to come to an agreement on the maximum amount of bytes
> copied per SDMA packet, and align the kernel code with Mesa. At the
> moment, the maximum limit is different for every SDMA version between
> the kernel and Mesa, which makes the whole topic super confusing.
>
SI didn't have SDMA.  It used the older paging and DMA engine.
SDMA 2.x used byte_count directly in the packet so the limit is 0x1fffff
SDMA 3.x used byte_count directly in the packet so the limit is
0x3fffe0 or 0x3fff00 depending on alignment.
SDMA 4.0.x used byte_count -1 in the packet so the limit is 0x400000.
SDMA 4.4.x used byte_count -1 in the packet and increased the limit to
0x4000000.
SDMA 5.0.x used byte_count -1 in the packet so the limit is 0x400000.
SDMA 5.2.x and newer used byte_count -1 in the packet and increased
the limit to 0x4000000.

Alex

> Please keep in mind that if there is any chance that the current limit
> is indeed incorrect (as it sounds like it is), we risk "random" SDMA
> hangs on users' machines.
>
> Thanks & best regards,
> Timur
>
>
>
>
> >
> > On Wed, Sep 10, 2025 at 10:57 AM Christian König
> > <christian.koe...@amd.com> wrote:
> > > On 10.09.25 16:52, Marek Olšák wrote:
> > > > It's probably better to use 0x3fff00 to match Mesa and PAL. There
> > > > is no benefit in using a different limit, not even a perf
> > > > benefit, and it's better to be consistent with all UMDs.
> > >
> > > Unification with Mesa is certainly a valid argument, but just using
> > > an approach because PAL does it has proven to be troublesome
> > > before.
> > >
> > > If we are interested in best performance we should actually round
> > > down the value to the next multiple of PAGE_SIZE.
> > >
> > > Christian.
> > >
> > > >
> > > > Marek
> > > >
> > > > On Wed, Sep 10, 2025 at 7:54 AM Christian König
> > > > <christian.koe...@amd.com <mailto:christian.koe...@amd.com>>
> > > > wrote:
> > > >
> > > >      On 10.09.25 11:34, Timur Kristóf wrote:
> > > >      > On Wed, 2025-09-10 at 10:34 +0200, Christian König wrote:
> > > >      >> On 09.09.25 18:56, Timur Kristóf wrote:
> > > >      >>>> Even when we apply it I think we should drop that, the
> > > > value the
> > > >      >>>> kernel uses is correct.
> > > >      >>>
> > > >      >>> Hi Christian,
> > > >      >>>
> > > >      >>> The kernel and Mesa disagree on the limits for almost
> > > > all SDMA
> > > >      >>> versions, so it would be nice to actually understand
> > > > what the
> > > >      >>> limits of
> > > >      >>> the SDMA HW are and use the same limit in the kernel and
> > > > Mesa, or
> > > >      >>> if
> > > >      >>> that isn't viable, let's document why the different
> > > > limits make
> > > >      >>> sense.
> > > >      >>>
> > > >      >>> I'm adding Marek to CC, he wrote the comment that I
> > > > referenced
> > > >      >>> here.
> > > >      >>> As far as I understand from my conversation with Marek,
> > > > the kernel
> > > >      >>> is
> > > >      >>> actually wrong.
> > > >      >>>
> > > >      >>> If the limits depend on alignment, then we should either
> > > > set a
> > > >      >>> limit
> > > >      >>> that is always safe, or make sure SDMA copies in the
> > > > kernel are
> > > >      >>> always
> > > >      >>> aligned and add assertions about it.
> > > >      >>
> > > >      >> That's already done. See the size restrictions applied to
> > > > BOs and the
> > > >      >> callers of amdgpu_copy_buffer().
> > > >      >
> > > >      > Based on the code comments I cited, as far as I
> > > > understand, the issue
> > > >      > is with copying the last 256 bytes of 2^22-1. Do I
> > > > understood your
> > > >      > response correctly that you are saying that the kernel
> > > > isn't affected
> > > >      > by this issue because it always copies things that are 256
> > > > byte
> > > >      > aligned?
> > > >
> > > >      Yes, see the kernel always works with at least PAGE_SIZE
> > > > buffers (between 4k and 64k depending on CPU architecture). You
> > > > never get anything smaller than that.
> > > >
> > > >      The only exception is the debugger interface, but that
> > > > always copies < PAGE_SIZE, so the SDMA limits are irrelevant.
> > > >
> > > >      > I checked the callers of amdgpu_copy_buffer and can't find
> > > > what you are
> > > >      > referring to. However, assuming that all callers use
> > > > amdgpu_copy_buffer
> > > >      > on 256 byte aligned addresses, there is still an issue
> > > > with large BOs:
> > > >      >
> > > >      > When the kernel copies a BO that is larger than 0x3fffe0
> > > > bytes then it
> > > >      > needs to emit multiple SDMA copy packets, and the copy
> > > > done by the
> > > >      > second packet (and other subsequent packets) won't be 256
> > > > byte aligned.
> > > >
> > > >      I've just double checked the documentation and for SI it
> > > > clearly states that the limit is 0x3fffe0. Documentation for
> > > > later HW says 2^22 - 1 - start_addr[4:2] (which is 0x1f in the
> > > > worst case so you end up with 0x3fffe0 again).
> > > >
> > > >      For backward, tiled and windowed copies a 256byte bounce
> > > > buffer is used instead of the 32byte buffer for the linear copy.
> > > > So the limit is then 0x3fff00.
> > > >
> > > >      Looks like that buffer is also used for byte aligned copies
> > > > and that limit applies there as well.
> > > >
> > > >      >>
> > > >      >> We could add another warning to amdgpu_copy_buffer(), but
> > > > that is
> > > >      >> just the backend function.
> > > >      >>
> > > >      >>> Looking at the implementation of
> > > >      >>> amdgpu_copy_buffer in the kernel, I see that it relies
> > > > on
> > > >      >>> copy_max_bytes and doesn't take alignment into account,
> > > > so with the
> > > >      >>> current limit it could issue subsequent copies that
> > > > aren't 256 byte
> > > >      >>> aligned.
> > > >      >>
> > > >      >> "Due to HW limitation, the maximum count may not be 2^n-
> > > > 1, can only
> > > >      >> be 2^n - 1 - start_addr[4:2]"
> > > >      >>
> > > >      >> Well according to this comments the size restriction is
> > > > 32 bytes (256
> > > >      >> bits!) and not 256 bytes.
> > > >      >>
> > > >      >> Were do you actually get the 256 bytes restriction from?
> > > >      >
> > > >      > The comments I cited above are from the following sources:
> > > >      >
> > > >      > PAL uses 1<<22-256 = 0x3fff00 here:
> > > >      >
> > > > https://github.com/GPUOpen-Drivers/pal/blob/bcec463efe5260776d486a5e3da0c549bc0a75d2/src/core/hw/ossip/oss2/oss2DmaCmdBuffer.cpp#L308
> > > > <https://github.com/GPUOpen-Drivers/pal/blob/bcec463efe5260776d48
> > > > 6a5e3da0c549bc0a75d2/src/core/hw/ossip/oss2/oss2DmaCmdBuffer.cpp#
> > > > L308>
> > > >      >
> > > >      > Mesa also uses 0x3fff00 here:
> > > >      >
> > > > https://gitlab.freedesktop.org/mesa/mesa/-/blob/47f5d25f93fd36224c112ee2d48e511ae078f8c2/src/amd/common/sid.h#L390
> > > > <https://gitlab.freedesktop.org/mesa/mesa/-/blob/47f5d25f93fd3622
> > > > 4c112ee2d48e511ae078f8c2/src/amd/common/sid.h#L390>
> > > >      >
> > > >      > The limit in Mesa was added by this commit:
> > > >      >
> > > > https://gitlab.freedesktop.org/mesa/mesa/-/commit/2c1f249f2b61be50222411bc0d41c095004232ed
> > > > <https://gitlab.freedesktop.org/mesa/mesa/-/commit/2c1f249f2b61be
> > > > 50222411bc0d41c095004232ed>
> > > >      > According to the commit message, Dave added this limit
> > > > when hitting an
> > > >      > issue trying to use SDMA with buffers that are larger than
> > > > this.
> > > >      >
> > > >      > For SDMA v5.2 and newer, a larger limit was added by Marek
> > > > later:
> > > >      >
> > > > https://gitlab.freedesktop.org/mesa/mesa/-/commit/a54bcb9429666fcbe38c04660cc4b3f8abbde259
> > > > <https://gitlab.freedesktop.org/mesa/mesa/-/commit/a54bcb9429666f
> > > > cbe38c04660cc4b3f8abbde259>
> > > >      > Which confirms the same issue copying the last 256 bytes
> > > > on these
> > > >      > versions, although in this case the kernel isn't
> > > > technically wrong
> > > >      > because it uses a smaller overall maximum.
> > > >
> > > >      Yeah, I mean that totally makes sense. When you want allow
> > > > byte aligned copies as well then you have to use 0x3fff00 as
> > > > maximum.
> > > >
> > > >      The PAL code even has extra checks to speed up copies where
> > > > src and dst are byte aligned, but size isn't:
> > > >
> > > >          if (IsPow2Aligned(srcGpuAddr, sizeof(uint32)) &&
> > > >              IsPow2Aligned(dstGpuAddr, sizeof(uint32)) &&
> > > >              (*pBytesCopied >= sizeof(uint32)))
> > > >      ....
> > > >
> > > >      So it totally makes sense to use the lower limit in Mesa but
> > > > not for the kernel.
> > > >
> > > >      Regards,
> > > >      Christian.
> > > >
> > >

Reply via email to