On Mon, 2025-09-15 at 09:07 -0400, Alex Deucher wrote:
> On Sat, Sep 13, 2025 at 1:28 AM <timur.kris...@gmail.com> wrote:
> > 
> > On Fri, 2025-09-12 at 15:38 -0400, Alex Deucher wrote:
> > > On Thu, Sep 11, 2025 at 2:18 PM Alex Deucher
> > > <alexdeuc...@gmail.com>
> > > wrote:
> > > > 
> > > > On Thu, Sep 11, 2025 at 1:25 PM Alex Deucher
> > > > <alexander.deuc...@amd.com> wrote:
> > > > > 
> > > > > SDMA 5.2.x has increased transfer limits.
> > > > > 
> > > > > v2: fix harder, use shifts to make it more obvious
> > > > > 
> > > > > Signed-off-by: Alex Deucher <alexander.deuc...@amd.com>
> > > > > ---
> > > > >  drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 4 ++--
> > > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> > > > > b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> > > > > index a8e39df29f343..bf227eadbe487 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> > > > > @@ -2065,11 +2065,11 @@ static void
> > > > > sdma_v5_2_emit_fill_buffer(struct amdgpu_ib *ib,
> > > > >  }
> > > > > 
> > > > >  static const struct amdgpu_buffer_funcs
> > > > > sdma_v5_2_buffer_funcs =
> > > > > {
> > > > > -       .copy_max_bytes = 0x400000,
> > > > > +       .copy_max_bytes = 1 << 30,
> > > > >         .copy_num_dw = 7,
> > > > >         .emit_copy_buffer = sdma_v5_2_emit_copy_buffer,
> > > > > 
> > > > > -       .fill_max_bytes = 0x400000,
> > > > > +       .fill_max_bytes = 1 << 30,
> > > > 
> > > > The hw docs and PAL differ here.  I've asked the hw designers
> > > > to
> > > > clarify.
> > > 
> > > The HW team verified that the hardware supports the extended
> > > range
> > > for
> > > both copies and fills.
> > > 
> > > Alex
> > 
> > Hi Alex,
> > 
> > This is still pretty confusing.
> > According to PAL, only SDMA v6 has the extended range for fills,
> > and it
> > can do 4 bytes fewer.
> > 
> > Are you sure that PAL is wrong about this?
> 
> I can talk to the PAL team as well.  I talked to the hardware
> designers and they verified that the hardware has the higher limit.
> It's the same underlying hardware so it makes sense that both copies
> and fills would have the same limit.

I am worried that they found some issues with it and that's why they
didn't enable it.

> 
> > 
> > For reference:
> > https://github.com/GPUOpen-Drivers/pal/blob/dev/src/core/hw/gfxip/sdma/gfx10/gfx10DmaCmdBuffer.cpp
> > https://github.com/GPUOpen-Drivers/pal/blob/dev/src/core/hw/gfxip/sdma/gfx12/gfx12DmaCmdBuffer.cpp
> > 
> > MaxCopySize on GFX10: 1 << 22
> > MaxCopySize on GFX10.3+: 1 << 30
> > 
> > MaxFillSize on GFX10-10.3: (1 << 22 - 1) & ~3
> > MaxFillSize on GFX11+: (1 << 30 - 1) & ~3
> > This makes sense because they program the count field in the packet
> > using the byte count minus four.
> 
> They are setting up the packet for dword fill rather than byte fill
> so
> count becomes dword aligned:
> 
>     // Because we will set fillsize = 2, the low two bits of our
> "count" are ignored, but we still program
>     // this in terms of bytes.

Yes. I thought we would prefer to use dword fill in the kernel as well,
isn't that the case? I thought dword fill is faster and everything that
the kernel fills would be already dword aligned. Am I missing
something?

Thanks,
Timur

Reply via email to