https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122751

            Bug ID: 122751
           Summary: missed fusing of sign-extend and MLA
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The following example

int foo (char *buf, int len) {

    int x;
    for (int i =0, y = 0; i < len; i++, y = i * 10) {
        x += (int) y * (buf[i] - '0');
    }
    return x;
}

compiled with -march=armv8-a+sve compiles to

.L3:
        ld1b    z28.s, p7/z, [x0, x2]
        add     x2, x2, x3
        sub     z28.h, z28.h, #48
        sxth    z28.s, p6/m, z28.s
        mla     z30.s, p7/m, z28.s, z29.s
        incw    z29.s, all, mul #10
        whilelo p7.s, w2, w1
        b.any   .L3
        uaddv   d31, p6, z30.s
        fmov    w0, s31
        ret

but sign extending the low 16-bits of every 32-lane is the same as sign
extending every even element from the vector.

as such sxth + mla above can be merged into smlalb in RTL.


i.e. expected code is

.L3:
        ld1b    z28.s, p7/z, [x0, x2]
        add     x2, x2, x3
        sub     z28.h, z28.h, #48
        smlalb  z30.s, p7/m, z28.h, z29.h
        incw    z29.s, all, mul #10
        whilelo p7.s, w2, w1
        b.any   .L3
        uaddv   d31, p6, z30.s
        fmov    w0, s31
        ret

this becomes even more impactful when the loop is unrolled:

int foo (char *buf, int len) {

    int x;
#pragma GCC unroll 8
    for (int i =0, y = 0; i < len; i++, y = i * 10) {
        x += (int) y * (buf[i] - '0');
    }
    return x;
}


we generate a mess, due to the predicates also needing to be unpacked:

.L3:
        ld1b    z27.h, p7/z, [x0, x2]
        punpklo p5.h, p7.b
        sub     z27.h, z27.h, #48
        punpkhi p6.h, p7.b
        sunpklo z0.s, z27.h
        mov     z26.d, z29.d
        mla     z30.s, p5/m, z0.s, z29.s
        add     x2, x2, x3
        sunpkhi z27.s, z27.h
        incw    z26.s, all, mul #10
        whilelo p7.h, w2, w1
        add     z29.s, z29.s, z28.s
        mla     z30.s, p6/m, z27.s, z26.s
        b.any   .L3
        uaddv   d31, p4, z30.s
        fmov    w0

this should have just generated smlalb + smlalt.

We can also do that in RTL though eventually we should teach the vectorizer
about top/bottom arithmetic.

Reply via email to