The optimal alignment for performance would be 256B because that's the
granularity of interleaving memory channels if you care about not
overloading one memory channel on the packet break. In other cases, like
gfx10 and newer, the next optimal alignment is 128B because that's the
cache line size, which is the maximum request size. If uncached, the next
optimal alignment is the minimum memory request size, which is 32B or 64B.

Since the kernel likely copies with VMID 0, the page table alignment
doesn't even enter the picture, though it would be cheap because it would
be a UTCL2 cache hit.

Marek

On Thu, Sep 11, 2025, 13:52 Alex Deucher <alexdeuc...@gmail.com> wrote:

> On Thu, Sep 11, 2025 at 10:46 AM Alex Deucher <alexdeuc...@gmail.com>
> wrote:
> >
> > On Wed, Sep 10, 2025 at 2:38 PM Timur Kristóf <timur.kris...@gmail.com>
> wrote:
> > >
> > > On Wed, 2025-09-10 at 11:10 -0400, Marek Olšák wrote:
> > > > I added the comment into Mesa that 0x3fff00 is the limit. I did
> > > > research on that bug separately from PAL, but I don't remember the
> > > > details.
> > > >
> > > > There is no performance to gain here. It's only about consistency and
> > > > clear communication to the public what the recommended SDMA
> > > > programming is.
> > > >
> > > > Marek
> > >
> > > Hi Christian & Marek,
> > >
> > > I'd be happy to submit a new patch series if we can agree on what the
> > > limits are for each SDMA version and how to get the best performance
> > > from it.
> > >
> > > I trust Marek's word that he did his research when he came up with
> > > 0x3fff00 so I don't understand the pushback here.
> > >
> > > (1) According to a conversation between Marek and myself, the most
> > > optimal way to program the SDMA would be to use 256-byte-aligned copy
> > > packets. Christian, do you agree with this? If yes, I can write a patch
> > > for that. If not, let me know what you would prefer.
> > >
> > > (2) I'd like us to come to an agreement on the maximum amount of bytes
> > > copied per SDMA packet, and align the kernel code with Mesa. At the
> > > moment, the maximum limit is different for every SDMA version between
> > > the kernel and Mesa, which makes the whole topic super confusing.
> > >
> > SI didn't have SDMA.  It used the older paging and DMA engine.
> > SDMA 2.x used byte_count directly in the packet so the limit is 0x1fffff
> > SDMA 3.x used byte_count directly in the packet so the limit is
> > 0x3fffe0 or 0x3fff00 depending on alignment.
> > SDMA 4.0.x used byte_count -1 in the packet so the limit is 0x400000.
> > SDMA 4.4.x used byte_count -1 in the packet and increased the limit to
> > 0x4000000.
> > SDMA 5.0.x used byte_count -1 in the packet so the limit is 0x400000.
> > SDMA 5.2.x and newer used byte_count -1 in the packet and increased
> > the limit to 0x4000000.
>
> Sorry I typed these.  There are 22 bits of byte count (0x400000) sdma
> 4.0.x and 5.0.x, and 30 bits of byte count for sdma 4.4.x and sdma 5.2
> and newer (0x40000000).
>
> Alex
>
> >
> > Alex
> >
> > > Please keep in mind that if there is any chance that the current limit
> > > is indeed incorrect (as it sounds like it is), we risk "random" SDMA
> > > hangs on users' machines.
> > >
> > > Thanks & best regards,
> > > Timur
> > >
> > >
> > >
> > >
> > > >
> > > > On Wed, Sep 10, 2025 at 10:57 AM Christian König
> > > > <christian.koe...@amd.com> wrote:
> > > > > On 10.09.25 16:52, Marek Olšák wrote:
> > > > > > It's probably better to use 0x3fff00 to match Mesa and PAL. There
> > > > > > is no benefit in using a different limit, not even a perf
> > > > > > benefit, and it's better to be consistent with all UMDs.
> > > > >
> > > > > Unification with Mesa is certainly a valid argument, but just using
> > > > > an approach because PAL does it has proven to be troublesome
> > > > > before.
> > > > >
> > > > > If we are interested in best performance we should actually round
> > > > > down the value to the next multiple of PAGE_SIZE.
> > > > >
> > > > > Christian.
> > > > >
> > > > > >
> > > > > > Marek
> > > > > >
> > > > > > On Wed, Sep 10, 2025 at 7:54 AM Christian König
> > > > > > <christian.koe...@amd.com <mailto:christian.koe...@amd.com>>
> > > > > > wrote:
> > > > > >
> > > > > >      On 10.09.25 11:34, Timur Kristóf wrote:
> > > > > >      > On Wed, 2025-09-10 at 10:34 +0200, Christian König wrote:
> > > > > >      >> On 09.09.25 18:56, Timur Kristóf wrote:
> > > > > >      >>>> Even when we apply it I think we should drop that, the
> > > > > > value the
> > > > > >      >>>> kernel uses is correct.
> > > > > >      >>>
> > > > > >      >>> Hi Christian,
> > > > > >      >>>
> > > > > >      >>> The kernel and Mesa disagree on the limits for almost
> > > > > > all SDMA
> > > > > >      >>> versions, so it would be nice to actually understand
> > > > > > what the
> > > > > >      >>> limits of
> > > > > >      >>> the SDMA HW are and use the same limit in the kernel and
> > > > > > Mesa, or
> > > > > >      >>> if
> > > > > >      >>> that isn't viable, let's document why the different
> > > > > > limits make
> > > > > >      >>> sense.
> > > > > >      >>>
> > > > > >      >>> I'm adding Marek to CC, he wrote the comment that I
> > > > > > referenced
> > > > > >      >>> here.
> > > > > >      >>> As far as I understand from my conversation with Marek,
> > > > > > the kernel
> > > > > >      >>> is
> > > > > >      >>> actually wrong.
> > > > > >      >>>
> > > > > >      >>> If the limits depend on alignment, then we should either
> > > > > > set a
> > > > > >      >>> limit
> > > > > >      >>> that is always safe, or make sure SDMA copies in the
> > > > > > kernel are
> > > > > >      >>> always
> > > > > >      >>> aligned and add assertions about it.
> > > > > >      >>
> > > > > >      >> That's already done. See the size restrictions applied to
> > > > > > BOs and the
> > > > > >      >> callers of amdgpu_copy_buffer().
> > > > > >      >
> > > > > >      > Based on the code comments I cited, as far as I
> > > > > > understand, the issue
> > > > > >      > is with copying the last 256 bytes of 2^22-1. Do I
> > > > > > understood your
> > > > > >      > response correctly that you are saying that the kernel
> > > > > > isn't affected
> > > > > >      > by this issue because it always copies things that are 256
> > > > > > byte
> > > > > >      > aligned?
> > > > > >
> > > > > >      Yes, see the kernel always works with at least PAGE_SIZE
> > > > > > buffers (between 4k and 64k depending on CPU architecture). You
> > > > > > never get anything smaller than that.
> > > > > >
> > > > > >      The only exception is the debugger interface, but that
> > > > > > always copies < PAGE_SIZE, so the SDMA limits are irrelevant.
> > > > > >
> > > > > >      > I checked the callers of amdgpu_copy_buffer and can't find
> > > > > > what you are
> > > > > >      > referring to. However, assuming that all callers use
> > > > > > amdgpu_copy_buffer
> > > > > >      > on 256 byte aligned addresses, there is still an issue
> > > > > > with large BOs:
> > > > > >      >
> > > > > >      > When the kernel copies a BO that is larger than 0x3fffe0
> > > > > > bytes then it
> > > > > >      > needs to emit multiple SDMA copy packets, and the copy
> > > > > > done by the
> > > > > >      > second packet (and other subsequent packets) won't be 256
> > > > > > byte aligned.
> > > > > >
> > > > > >      I've just double checked the documentation and for SI it
> > > > > > clearly states that the limit is 0x3fffe0. Documentation for
> > > > > > later HW says 2^22 - 1 - start_addr[4:2] (which is 0x1f in the
> > > > > > worst case so you end up with 0x3fffe0 again).
> > > > > >
> > > > > >      For backward, tiled and windowed copies a 256byte bounce
> > > > > > buffer is used instead of the 32byte buffer for the linear copy.
> > > > > > So the limit is then 0x3fff00.
> > > > > >
> > > > > >      Looks like that buffer is also used for byte aligned copies
> > > > > > and that limit applies there as well.
> > > > > >
> > > > > >      >>
> > > > > >      >> We could add another warning to amdgpu_copy_buffer(), but
> > > > > > that is
> > > > > >      >> just the backend function.
> > > > > >      >>
> > > > > >      >>> Looking at the implementation of
> > > > > >      >>> amdgpu_copy_buffer in the kernel, I see that it relies
> > > > > > on
> > > > > >      >>> copy_max_bytes and doesn't take alignment into account,
> > > > > > so with the
> > > > > >      >>> current limit it could issue subsequent copies that
> > > > > > aren't 256 byte
> > > > > >      >>> aligned.
> > > > > >      >>
> > > > > >      >> "Due to HW limitation, the maximum count may not be 2^n-
> > > > > > 1, can only
> > > > > >      >> be 2^n - 1 - start_addr[4:2]"
> > > > > >      >>
> > > > > >      >> Well according to this comments the size restriction is
> > > > > > 32 bytes (256
> > > > > >      >> bits!) and not 256 bytes.
> > > > > >      >>
> > > > > >      >> Were do you actually get the 256 bytes restriction from?
> > > > > >      >
> > > > > >      > The comments I cited above are from the following sources:
> > > > > >      >
> > > > > >      > PAL uses 1<<22-256 = 0x3fff00 here:
> > > > > >      >
> > > > > >
> https://github.com/GPUOpen-Drivers/pal/blob/bcec463efe5260776d486a5e3da0c549bc0a75d2/src/core/hw/ossip/oss2/oss2DmaCmdBuffer.cpp#L308
> > > > > > <
> https://github.com/GPUOpen-Drivers/pal/blob/bcec463efe5260776d48
> > > > > > 6a5e3da0c549bc0a75d2/src/core/hw/ossip/oss2/oss2DmaCmdBuffer.cpp#
> > > > > > L308>
> > > > > >      >
> > > > > >      > Mesa also uses 0x3fff00 here:
> > > > > >      >
> > > > > >
> https://gitlab.freedesktop.org/mesa/mesa/-/blob/47f5d25f93fd36224c112ee2d48e511ae078f8c2/src/amd/common/sid.h#L390
> > > > > > <
> https://gitlab.freedesktop.org/mesa/mesa/-/blob/47f5d25f93fd3622
> > > > > > 4c112ee2d48e511ae078f8c2/src/amd/common/sid.h#L390>
> > > > > >      >
> > > > > >      > The limit in Mesa was added by this commit:
> > > > > >      >
> > > > > >
> https://gitlab.freedesktop.org/mesa/mesa/-/commit/2c1f249f2b61be50222411bc0d41c095004232ed
> > > > > > <
> https://gitlab.freedesktop.org/mesa/mesa/-/commit/2c1f249f2b61be
> > > > > > 50222411bc0d41c095004232ed>
> > > > > >      > According to the commit message, Dave added this limit
> > > > > > when hitting an
> > > > > >      > issue trying to use SDMA with buffers that are larger than
> > > > > > this.
> > > > > >      >
> > > > > >      > For SDMA v5.2 and newer, a larger limit was added by Marek
> > > > > > later:
> > > > > >      >
> > > > > >
> https://gitlab.freedesktop.org/mesa/mesa/-/commit/a54bcb9429666fcbe38c04660cc4b3f8abbde259
> > > > > > <
> https://gitlab.freedesktop.org/mesa/mesa/-/commit/a54bcb9429666f
> > > > > > cbe38c04660cc4b3f8abbde259>
> > > > > >      > Which confirms the same issue copying the last 256 bytes
> > > > > > on these
> > > > > >      > versions, although in this case the kernel isn't
> > > > > > technically wrong
> > > > > >      > because it uses a smaller overall maximum.
> > > > > >
> > > > > >      Yeah, I mean that totally makes sense. When you want allow
> > > > > > byte aligned copies as well then you have to use 0x3fff00 as
> > > > > > maximum.
> > > > > >
> > > > > >      The PAL code even has extra checks to speed up copies where
> > > > > > src and dst are byte aligned, but size isn't:
> > > > > >
> > > > > >          if (IsPow2Aligned(srcGpuAddr, sizeof(uint32)) &&
> > > > > >              IsPow2Aligned(dstGpuAddr, sizeof(uint32)) &&
> > > > > >              (*pBytesCopied >= sizeof(uint32)))
> > > > > >      ....
> > > > > >
> > > > > >      So it totally makes sense to use the lower limit in Mesa but
> > > > > > not for the kernel.
> > > > > >
> > > > > >      Regards,
> > > > > >      Christian.
> > > > > >
> > > > >
>

Reply via email to