ср, 17 дек. 2025 г. в 00:16, Georg-Johann Lay <[email protected]>:
>
> When a shift is performed by a shift-loop, then there are cases
> where the runtime can be improved. For example, uint32_t R22 >> 5
> is currently
>
>          ldi srcatch, 5
>      1:  lsr r25
>          ror r24
>          ror r23
>          ror r22
>          dec scratch
>          brne 1b
>
> but can be done as:
>
>          andi r22,-32   ; Set lower 5 bits to 0.
>          ori r22,16     ; Set bit 4 to 1.
>          ;; Now r22 = 0b***10000
>      1:  lsr r25
>          ror r24
>          ror r23
>          ror r22
>          brcc 1b        ; Carry will be 0, 0, 0, 0, 1.
>
> this is count-1 cycles faster where count is the shift offset.
> In the example that's 4 cycles.
>
> Part 1 of the patch refactors the shift output function so
> it gets a shift rtx_code instead of an asm template.
>
> Part 2 is the very optimization.
>
> This is for trunk and passes without new regressions.
> Ok to apply?

Ok. Please apply.

Denis.

Reply via email to