On Mon, 2025-09-15 at 09:07 -0400, Alex Deucher wrote: > On Sat, Sep 13, 2025 at 1:28 AM <timur.kris...@gmail.com> wrote: > > > > On Fri, 2025-09-12 at 15:38 -0400, Alex Deucher wrote: > > > On Thu, Sep 11, 2025 at 2:18 PM Alex Deucher > > > <alexdeuc...@gmail.com> > > > wrote: > > > > > > > > On Thu, Sep 11, 2025 at 1:25 PM Alex Deucher > > > > <alexander.deuc...@amd.com> wrote: > > > > > > > > > > SDMA 5.2.x has increased transfer limits. > > > > > > > > > > v2: fix harder, use shifts to make it more obvious > > > > > > > > > > Signed-off-by: Alex Deucher <alexander.deuc...@amd.com> > > > > > --- > > > > > drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 4 ++-- > > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c > > > > > b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c > > > > > index a8e39df29f343..bf227eadbe487 100644 > > > > > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c > > > > > @@ -2065,11 +2065,11 @@ static void > > > > > sdma_v5_2_emit_fill_buffer(struct amdgpu_ib *ib, > > > > > } > > > > > > > > > > static const struct amdgpu_buffer_funcs > > > > > sdma_v5_2_buffer_funcs = > > > > > { > > > > > - .copy_max_bytes = 0x400000, > > > > > + .copy_max_bytes = 1 << 30, > > > > > .copy_num_dw = 7, > > > > > .emit_copy_buffer = sdma_v5_2_emit_copy_buffer, > > > > > > > > > > - .fill_max_bytes = 0x400000, > > > > > + .fill_max_bytes = 1 << 30, > > > > > > > > The hw docs and PAL differ here. I've asked the hw designers > > > > to > > > > clarify. > > > > > > The HW team verified that the hardware supports the extended > > > range > > > for > > > both copies and fills. > > > > > > Alex > > > > Hi Alex, > > > > This is still pretty confusing. > > According to PAL, only SDMA v6 has the extended range for fills, > > and it > > can do 4 bytes fewer. > > > > Are you sure that PAL is wrong about this? > > I can talk to the PAL team as well. I talked to the hardware > designers and they verified that the hardware has the higher limit. > It's the same underlying hardware so it makes sense that both copies > and fills would have the same limit.
I am worried that they found some issues with it and that's why they didn't enable it. > > > > > For reference: > > https://github.com/GPUOpen-Drivers/pal/blob/dev/src/core/hw/gfxip/sdma/gfx10/gfx10DmaCmdBuffer.cpp > > https://github.com/GPUOpen-Drivers/pal/blob/dev/src/core/hw/gfxip/sdma/gfx12/gfx12DmaCmdBuffer.cpp > > > > MaxCopySize on GFX10: 1 << 22 > > MaxCopySize on GFX10.3+: 1 << 30 > > > > MaxFillSize on GFX10-10.3: (1 << 22 - 1) & ~3 > > MaxFillSize on GFX11+: (1 << 30 - 1) & ~3 > > This makes sense because they program the count field in the packet > > using the byte count minus four. > > They are setting up the packet for dword fill rather than byte fill > so > count becomes dword aligned: > > // Because we will set fillsize = 2, the low two bits of our > "count" are ignored, but we still program > // this in terms of bytes. Yes. I thought we would prefer to use dword fill in the kernel as well, isn't that the case? I thought dword fill is faster and everything that the kernel fills would be already dword aligned. Am I missing something? Thanks, Timur