On 08/10/2025 15:08, Alex Deucher wrote:
Applied the series. Thanks!
Thank you and fingers crossed I did not fat finger anything that your CI wouldn't find!
I wish something similar could be done for amdgpu_ring_write too, but that one is waiting on Christian to, AFAIR, become idle enough to untangle some ptr masking issues.
Regards, Tvrtko
On Thu, Sep 11, 2025 at 7:42 AM Tvrtko Ursulin <[email protected]> wrote:In short, this series mostly does a lot of replacing of this pattern: ib->ptr[ib->length_dw++] = SDMA_PKT_HEADER_OP(SDMA_OP_WRITE) | SDMA_PKT_HEADER_SUB_OP(SDMA_SUBOP_WRITE_LINEAR); ib->ptr[ib->length_dw++] = lower_32_bits(pe); ib->ptr[ib->length_dw++] = upper_32_bits(pe); ib->ptr[ib->length_dw++] = ndw - 1; for (; ndw > 0; ndw -= 2) { ib->ptr[ib->length_dw++] = lower_32_bits(value); ib->ptr[ib->length_dw++] = upper_32_bits(value); value += incr; } With this one: u32 *ptr = &ib->ptr[ib->length_dw]; *ptr++ = SDMA_PKT_HEADER_OP(SDMA_OP_WRITE) | SDMA_PKT_HEADER_SUB_OP(SDMA_SUBOP_WRITE_LINEAR); *ptr++ = lower_32_bits(pe); *ptr++ = upper_32_bits(pe); *ptr++ = ndw - 1; for (; ndw > 0; ndw -= 2) { *ptr++ = lower_32_bits(value); *ptr++ = upper_32_bits(value); value += incr; } ib->length_dw = ptr - ib->ptr; Latter avoids register reloads and length updates on every dword written, and on the overall makes the IB emission much more compact: add/remove: 0/1 grow/shrink: 10/58 up/down: 260/-6598 (-6338) Function old new delta sdma_v7_0_ring_pad_ib 99 127 +28 sdma_v6_0_ring_pad_ib 99 127 +28 sdma_v5_2_ring_pad_ib 99 127 +28 sdma_v5_0_ring_pad_ib 99 127 +28 sdma_v4_4_2_ring_pad_ib 99 127 +28 sdma_v4_0_ring_pad_ib 99 127 +28 sdma_v3_0_ring_pad_ib 99 127 +28 sdma_v2_4_ring_pad_ib 99 127 +28 cik_sdma_ring_pad_ib 99 127 +28 si_dma_ring_pad_ib 36 44 +8 amdgpu_ring_generic_pad_ib 56 52 -4 si_dma_emit_fill_buffer 108 71 -37 si_dma_vm_write_pte 158 115 -43 amdgpu_vcn_dec_sw_send_msg 810 767 -43 si_dma_vm_copy_pte 137 87 -50 si_dma_emit_copy_buffer 134 84 -50 sdma_v3_0_vm_write_pte 163 102 -61 sdma_v2_4_vm_write_pte 163 102 -61 cik_sdma_vm_write_pte 163 102 -61 sdma_v7_0_vm_write_pte 168 105 -63 sdma_v7_0_emit_fill_buffer 119 56 -63 sdma_v6_0_vm_write_pte 168 105 -63 sdma_v6_0_emit_fill_buffer 119 56 -63 sdma_v5_2_vm_write_pte 168 105 -63 sdma_v5_2_emit_fill_buffer 119 56 -63 sdma_v5_0_vm_write_pte 168 105 -63 sdma_v5_0_emit_fill_buffer 119 56 -63 sdma_v4_4_2_vm_write_pte 168 105 -63 sdma_v4_4_2_emit_fill_buffer 119 56 -63 sdma_v4_0_vm_write_pte 168 105 -63 sdma_v4_0_emit_fill_buffer 119 56 -63 sdma_v3_0_emit_fill_buffer 116 53 -63 sdma_v2_4_emit_fill_buffer 116 53 -63 cik_sdma_emit_fill_buffer 116 53 -63 sdma_v6_0_emit_copy_buffer 169 76 -93 sdma_v5_2_emit_copy_buffer 169 76 -93 sdma_v5_0_emit_copy_buffer 169 76 -93 sdma_v4_4_2_emit_copy_buffer 169 76 -93 sdma_v4_0_emit_copy_buffer 169 76 -93 sdma_v3_0_vm_copy_pte 158 64 -94 sdma_v3_0_emit_copy_buffer 155 61 -94 sdma_v2_4_vm_copy_pte 158 64 -94 sdma_v2_4_emit_copy_buffer 155 61 -94 cik_sdma_vm_copy_pte 158 64 -94 cik_sdma_emit_copy_buffer 155 61 -94 sdma_v6_0_vm_copy_pte 163 68 -95 sdma_v5_2_vm_copy_pte 163 68 -95 sdma_v5_0_vm_copy_pte 163 68 -95 sdma_v4_4_2_vm_copy_pte 163 68 -95 sdma_v4_0_vm_copy_pte 163 68 -95 sdma_v7_0_vm_copy_pte 183 75 -108 sdma_v7_0_emit_copy_buffer 317 202 -115 si_dma_vm_set_pte_pde 338 214 -124 amdgpu_vce_get_destroy_msg 784 652 -132 sdma_v7_0_vm_set_pte_pde 218 72 -146 sdma_v6_0_vm_set_pte_pde 218 72 -146 sdma_v5_2_vm_set_pte_pde 218 72 -146 sdma_v5_0_vm_set_pte_pde 218 72 -146 sdma_v4_4_2_vm_set_pte_pde 218 72 -146 sdma_v4_0_vm_set_pte_pde 218 72 -146 sdma_v3_0_vm_set_pte_pde 215 69 -146 sdma_v2_4_vm_set_pte_pde 215 69 -146 cik_sdma_vm_set_pte_pde 215 69 -146 amdgpu_vcn_unified_ring_ib_header 172 - -172 gfx_v9_4_2_run_shader.constprop 739 532 -207 uvd_v6_0_enc_ring_test_ib 1464 1162 -302 uvd_v7_0_enc_ring_test_ib 1464 1138 -326 amdgpu_vce_ring_test_ib 1357 936 -421 amdgpu_vcn_enc_ring_test_ib 2042 1524 -518 Total: Before=9262623, After=9256285, chg -0.07% * Notice how _pad_ib functions have grown. I think the compiler used the opportunity to unroll the loops. ** Series was only smoke tested on the Steam Deck. Tvrtko Ursulin (16): drm/amdgpu: Use memset32 for IB padding drm/amdgpu: More compact VCE IB emission drm/amdgpu: More compact VCN IB emission drm/amdgpu: More compact UVD 6 IB emission drm/amdgpu: More compact UVD 7 IB emission drm/amdgpu: More compact SI SDMA emission drm/amdgpu: More compact CIK SDMA IB emission drm/amdgpu: More compact GFX 9.4.2 IB emission drm/amdgpu: More compact SDMA 2.4 IB emission drm/amdgpu: More compact SDMA 3.0 IB emission drm/amdgpu: More compact SDMA 4.0 IB emission drm/amdgpu: More compact SDMA 4.4.2 IB emission drm/amdgpu: More compact SDMA 5.0 IB emission drm/amdgpu: More compact SDMA 5.2 IB emission drm/amdgpu: More compact SDMA 6.0 IB emission drm/amdgpu: More compact SDMA 7.0 IB emission drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 12 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c | 90 +++++++++-------- drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 101 ++++++++++--------- drivers/gpu/drm/amd/amdgpu/cik_sdma.c | 105 ++++++++++++-------- drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c | 46 ++++----- drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c | 108 ++++++++++++-------- drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c | 108 ++++++++++++-------- drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 109 ++++++++++++--------- drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 108 ++++++++++++-------- drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c | 106 ++++++++++++-------- drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 110 ++++++++++++--------- drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 110 ++++++++++++--------- drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 119 +++++++++++++---------- drivers/gpu/drm/amd/amdgpu/si_dma.c | 84 +++++++++------- drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c | 66 +++++++------ drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c | 66 +++++++------ 16 files changed, 849 insertions(+), 599 deletions(-) -- 2.48.0
