On Wed, Sep 10, 2025 at 2:38 PM Timur Kristóf <timur.kris...@gmail.com> wrote: > > On Wed, 2025-09-10 at 11:10 -0400, Marek Olšák wrote: > > I added the comment into Mesa that 0x3fff00 is the limit. I did > > research on that bug separately from PAL, but I don't remember the > > details. > > > > There is no performance to gain here. It's only about consistency and > > clear communication to the public what the recommended SDMA > > programming is. > > > > Marek > > Hi Christian & Marek, > > I'd be happy to submit a new patch series if we can agree on what the > limits are for each SDMA version and how to get the best performance > from it. > > I trust Marek's word that he did his research when he came up with > 0x3fff00 so I don't understand the pushback here. > > (1) According to a conversation between Marek and myself, the most > optimal way to program the SDMA would be to use 256-byte-aligned copy > packets. Christian, do you agree with this? If yes, I can write a patch > for that. If not, let me know what you would prefer. > > (2) I'd like us to come to an agreement on the maximum amount of bytes > copied per SDMA packet, and align the kernel code with Mesa. At the > moment, the maximum limit is different for every SDMA version between > the kernel and Mesa, which makes the whole topic super confusing. > SI didn't have SDMA. It used the older paging and DMA engine. SDMA 2.x used byte_count directly in the packet so the limit is 0x1fffff SDMA 3.x used byte_count directly in the packet so the limit is 0x3fffe0 or 0x3fff00 depending on alignment. SDMA 4.0.x used byte_count -1 in the packet so the limit is 0x400000. SDMA 4.4.x used byte_count -1 in the packet and increased the limit to 0x4000000. SDMA 5.0.x used byte_count -1 in the packet so the limit is 0x400000. SDMA 5.2.x and newer used byte_count -1 in the packet and increased the limit to 0x4000000.
Alex > Please keep in mind that if there is any chance that the current limit > is indeed incorrect (as it sounds like it is), we risk "random" SDMA > hangs on users' machines. > > Thanks & best regards, > Timur > > > > > > > > On Wed, Sep 10, 2025 at 10:57 AM Christian König > > <christian.koe...@amd.com> wrote: > > > On 10.09.25 16:52, Marek Olšák wrote: > > > > It's probably better to use 0x3fff00 to match Mesa and PAL. There > > > > is no benefit in using a different limit, not even a perf > > > > benefit, and it's better to be consistent with all UMDs. > > > > > > Unification with Mesa is certainly a valid argument, but just using > > > an approach because PAL does it has proven to be troublesome > > > before. > > > > > > If we are interested in best performance we should actually round > > > down the value to the next multiple of PAGE_SIZE. > > > > > > Christian. > > > > > > > > > > > Marek > > > > > > > > On Wed, Sep 10, 2025 at 7:54 AM Christian König > > > > <christian.koe...@amd.com <mailto:christian.koe...@amd.com>> > > > > wrote: > > > > > > > > On 10.09.25 11:34, Timur Kristóf wrote: > > > > > On Wed, 2025-09-10 at 10:34 +0200, Christian König wrote: > > > > >> On 09.09.25 18:56, Timur Kristóf wrote: > > > > >>>> Even when we apply it I think we should drop that, the > > > > value the > > > > >>>> kernel uses is correct. > > > > >>> > > > > >>> Hi Christian, > > > > >>> > > > > >>> The kernel and Mesa disagree on the limits for almost > > > > all SDMA > > > > >>> versions, so it would be nice to actually understand > > > > what the > > > > >>> limits of > > > > >>> the SDMA HW are and use the same limit in the kernel and > > > > Mesa, or > > > > >>> if > > > > >>> that isn't viable, let's document why the different > > > > limits make > > > > >>> sense. > > > > >>> > > > > >>> I'm adding Marek to CC, he wrote the comment that I > > > > referenced > > > > >>> here. > > > > >>> As far as I understand from my conversation with Marek, > > > > the kernel > > > > >>> is > > > > >>> actually wrong. > > > > >>> > > > > >>> If the limits depend on alignment, then we should either > > > > set a > > > > >>> limit > > > > >>> that is always safe, or make sure SDMA copies in the > > > > kernel are > > > > >>> always > > > > >>> aligned and add assertions about it. > > > > >> > > > > >> That's already done. See the size restrictions applied to > > > > BOs and the > > > > >> callers of amdgpu_copy_buffer(). > > > > > > > > > > Based on the code comments I cited, as far as I > > > > understand, the issue > > > > > is with copying the last 256 bytes of 2^22-1. Do I > > > > understood your > > > > > response correctly that you are saying that the kernel > > > > isn't affected > > > > > by this issue because it always copies things that are 256 > > > > byte > > > > > aligned? > > > > > > > > Yes, see the kernel always works with at least PAGE_SIZE > > > > buffers (between 4k and 64k depending on CPU architecture). You > > > > never get anything smaller than that. > > > > > > > > The only exception is the debugger interface, but that > > > > always copies < PAGE_SIZE, so the SDMA limits are irrelevant. > > > > > > > > > I checked the callers of amdgpu_copy_buffer and can't find > > > > what you are > > > > > referring to. However, assuming that all callers use > > > > amdgpu_copy_buffer > > > > > on 256 byte aligned addresses, there is still an issue > > > > with large BOs: > > > > > > > > > > When the kernel copies a BO that is larger than 0x3fffe0 > > > > bytes then it > > > > > needs to emit multiple SDMA copy packets, and the copy > > > > done by the > > > > > second packet (and other subsequent packets) won't be 256 > > > > byte aligned. > > > > > > > > I've just double checked the documentation and for SI it > > > > clearly states that the limit is 0x3fffe0. Documentation for > > > > later HW says 2^22 - 1 - start_addr[4:2] (which is 0x1f in the > > > > worst case so you end up with 0x3fffe0 again). > > > > > > > > For backward, tiled and windowed copies a 256byte bounce > > > > buffer is used instead of the 32byte buffer for the linear copy. > > > > So the limit is then 0x3fff00. > > > > > > > > Looks like that buffer is also used for byte aligned copies > > > > and that limit applies there as well. > > > > > > > > >> > > > > >> We could add another warning to amdgpu_copy_buffer(), but > > > > that is > > > > >> just the backend function. > > > > >> > > > > >>> Looking at the implementation of > > > > >>> amdgpu_copy_buffer in the kernel, I see that it relies > > > > on > > > > >>> copy_max_bytes and doesn't take alignment into account, > > > > so with the > > > > >>> current limit it could issue subsequent copies that > > > > aren't 256 byte > > > > >>> aligned. > > > > >> > > > > >> "Due to HW limitation, the maximum count may not be 2^n- > > > > 1, can only > > > > >> be 2^n - 1 - start_addr[4:2]" > > > > >> > > > > >> Well according to this comments the size restriction is > > > > 32 bytes (256 > > > > >> bits!) and not 256 bytes. > > > > >> > > > > >> Were do you actually get the 256 bytes restriction from? > > > > > > > > > > The comments I cited above are from the following sources: > > > > > > > > > > PAL uses 1<<22-256 = 0x3fff00 here: > > > > > > > > > https://github.com/GPUOpen-Drivers/pal/blob/bcec463efe5260776d486a5e3da0c549bc0a75d2/src/core/hw/ossip/oss2/oss2DmaCmdBuffer.cpp#L308 > > > > <https://github.com/GPUOpen-Drivers/pal/blob/bcec463efe5260776d48 > > > > 6a5e3da0c549bc0a75d2/src/core/hw/ossip/oss2/oss2DmaCmdBuffer.cpp# > > > > L308> > > > > > > > > > > Mesa also uses 0x3fff00 here: > > > > > > > > > https://gitlab.freedesktop.org/mesa/mesa/-/blob/47f5d25f93fd36224c112ee2d48e511ae078f8c2/src/amd/common/sid.h#L390 > > > > <https://gitlab.freedesktop.org/mesa/mesa/-/blob/47f5d25f93fd3622 > > > > 4c112ee2d48e511ae078f8c2/src/amd/common/sid.h#L390> > > > > > > > > > > The limit in Mesa was added by this commit: > > > > > > > > > https://gitlab.freedesktop.org/mesa/mesa/-/commit/2c1f249f2b61be50222411bc0d41c095004232ed > > > > <https://gitlab.freedesktop.org/mesa/mesa/-/commit/2c1f249f2b61be > > > > 50222411bc0d41c095004232ed> > > > > > According to the commit message, Dave added this limit > > > > when hitting an > > > > > issue trying to use SDMA with buffers that are larger than > > > > this. > > > > > > > > > > For SDMA v5.2 and newer, a larger limit was added by Marek > > > > later: > > > > > > > > > https://gitlab.freedesktop.org/mesa/mesa/-/commit/a54bcb9429666fcbe38c04660cc4b3f8abbde259 > > > > <https://gitlab.freedesktop.org/mesa/mesa/-/commit/a54bcb9429666f > > > > cbe38c04660cc4b3f8abbde259> > > > > > Which confirms the same issue copying the last 256 bytes > > > > on these > > > > > versions, although in this case the kernel isn't > > > > technically wrong > > > > > because it uses a smaller overall maximum. > > > > > > > > Yeah, I mean that totally makes sense. When you want allow > > > > byte aligned copies as well then you have to use 0x3fff00 as > > > > maximum. > > > > > > > > The PAL code even has extra checks to speed up copies where > > > > src and dst are byte aligned, but size isn't: > > > > > > > > if (IsPow2Aligned(srcGpuAddr, sizeof(uint32)) && > > > > IsPow2Aligned(dstGpuAddr, sizeof(uint32)) && > > > > (*pBytesCopied >= sizeof(uint32))) > > > > .... > > > > > > > > So it totally makes sense to use the lower limit in Mesa but > > > > not for the kernel. > > > > > > > > Regards, > > > > Christian. > > > > > > >