https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125893

--- Comment #2 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by H.J. Lu <[email protected]>:

https://gcc.gnu.org/g:1f774d902f1ec9ae6a487e00ba49514d3b37057f

commit r17-1766-g1f774d902f1ec9ae6a487e00ba49514d3b37057f
Author: H.J. Lu <[email protected]>
Date:   Sun Jun 21 07:13:38 2026 +0800

    x86: Use previous scratch register in LCP stall peepholes

    Since LCP stall peepholes are added after register allocation, each
    peephole may use a different scratch register.  For input:

    extern void bar (void);

    void
    foo (short *dst)
    {
      dst[0] = 3;
      asm volatile ("" : : : "memory");
      dst[2] = 3;
      bar ();
      dst[1] = 3;
      asm volatile ("" : : : "memory");
      dst[4] = 3;
    }

    with LCP stall peepholes, GCC generates:

            movl    $3, %eax
            pushq   %rbx
            movq    %rdi, %rbx
            movw    %ax, (%rdi)
            movl    $3, %edx
            movw    %dx, 4(%rdi)
            call    bar
            movl    $3, %ecx
            movw    %cx, 2(%rbx)
            movl    $3, %esi
            movw    %si, 8(%rbx)
            popq    %rbx

    using 4 different scratch registers vs without LCP stall peepholes:

            pushq   %rbx
            movq    %rdi, %rbx
            movw    $3, (%rdi)
            movw    $3, 4(%rdi)
            call    bar
            movw    $3, 2(%rbx)
            movw    $3, 8(%rbx)
            popq    %rbx

    Add ix86_output_lcp_stall_peephole to generate LCP stall peepholes with
    the previous scratch register:

    1. Scan backward for the previous scratch register definition with
    the same immediate operand in the same basic block.
    2. The previous scratch register is unusable if it is set between the
    previous scratch register definition and the current instruction.
    3. If a usable previous scratch register is found, ignore the allocated
    scratch register and use the previous scratch register.  Otherwise, use
    the allocated scratch register.

    so that the same scratch register can be reused if possible:

            movl    $3, %eax
            pushq   %rbx
            movq    %rdi, %rbx
            movw    %ax, (%rdi)
            movw    %ax, 4(%rdi)
            call    bar
            movl    $3, %ecx
            movw    %cx, 2(%rbx)
            movw    %cx, 8(%rbx)
            popq    %rbx

    I backported this patch to GCC 16:

    1. When bootstrapping GCC 16 with only C and C++ enabled, this optimization
    triggers 54 times.  No regressions.
    2. When building glibc 2.44, this optimization triggers 33 times.  No
    regressions.
    3. When building Linux kernel 7.1.1, this optimization triggers 2099 times.
    Kernel boots correctly.

    Tested on Linux/x86-64 and Linux/i686.

    gcc/

            PR target/125893
            * config/i386/i386-expand.cc (ix86_expand_lcp_stall_peephole): New.
            * config/i386/i386-protos.h (ix86_expand_lcp_stall_peephole):
            Likewise.
            * config/i386/i386.md (TARGET_LCP_STALL peepholes): Call
            ix86_expand_lcp_stall_peephole.

    gcc/testsuite/

            PR target/125893
            * gcc.target/i386/pr125893-1.c: New test.
            * gcc.target/i386/pr125893-2.c: Likewise.
            * gcc.target/i386/pr125893-3.c: Likewise.
            * gcc.target/i386/pr125893-4.c: Likewise.
            * gcc.target/i386/pr125893-5.c: Likewise.
            * gcc.target/i386/pr125893-6.c: Likewise.
            * gcc.target/i386/pr125893-7.c: Likewise.
            * gcc.target/i386/pr125893-8.c: Likewise.
            * gcc.target/i386/pr125893-9.c: Likewise.
            * gcc.target/i386/pr125893-10.c: Likewise.
            * gcc.target/i386/pr125893-11.c: Likewise.

    Signed-off-by: H.J. Lu <[email protected]>

Reply via email to