short vec CTOR

linkw at gcc dot gnu.org Fri, 04 Sep 2020 02:31:16 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933


            Bug ID: 96933
           Summary: inefficient code for char/short vec CTOR
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

When I'm investigate the vectorization cost for vec_construct, I happened to
find the generated code for vector construction is inefficient with DIRECT_MOVE
support.

The test case looks like:

vector unsigned char test_char(unsigned char f1, unsigned char f2,
                               unsigned char f3, unsigned char f4,
                               unsigned char f5, unsigned char f6,
                               unsigned char f7, unsigned char f8,
                               unsigned char f9, unsigned char f10,
                               unsigned char f11, unsigned char f12,
                               unsigned char f13, unsigned char f14,
                               unsigned char f15, unsigned char f16) {

  vector unsigned char v = {f1, f2,  f3,  f4,  f5,  f6,  f7,  f8,
                            f9, f10, f11, f12, f13, f14, f15, f16};
  return v;
}

The generated code currently with -mcpu=power9:

0000000000000000 <test_char>:
   0:   e8 ff a1 fb     std     r29,-24(r1)
   4:   f0 ff c1 fb     std     r30,-16(r1)
   8:   f8 ff e1 fb     std     r31,-8(r1)
   c:   60 00 a1 8b     lbz     r29,96(r1)
  10:   68 00 c1 8b     lbz     r30,104(r1)
  14:   70 00 e1 8b     lbz     r31,112(r1)
  18:   d1 ff 81 98     stb     r4,-47(r1)
  1c:   d2 ff a1 98     stb     r5,-46(r1)
  20:   78 00 81 89     lbz     r12,120(r1)
  24:   80 00 01 88     lbz     r0,128(r1)
  28:   88 00 61 89     lbz     r11,136(r1)
  2c:   90 00 81 88     lbz     r4,144(r1)
  30:   98 00 a1 88     lbz     r5,152(r1)
  34:   d0 ff 61 98     stb     r3,-48(r1)
  38:   d3 ff c1 98     stb     r6,-45(r1)
  3c:   d4 ff e1 98     stb     r7,-44(r1)
  40:   d8 ff a1 9b     stb     r29,-40(r1)
  44:   d5 ff 01 99     stb     r8,-43(r1)
  48:   d6 ff 21 99     stb     r9,-42(r1)
  4c:   d7 ff 41 99     stb     r10,-41(r1)
  50:   d9 ff c1 9b     stb     r30,-39(r1)
  54:   da ff e1 9b     stb     r31,-38(r1)
  58:   db ff 81 99     stb     r12,-37(r1)
  5c:   dc ff 01 98     stb     r0,-36(r1)
  60:   dd ff 61 99     stb     r11,-35(r1)
  64:   de ff 81 98     stb     r4,-34(r1)
  68:   df ff a1 98     stb     r5,-33(r1)
  6c:   e8 ff a1 eb     ld      r29,-24(r1)
  70:   f0 ff c1 eb     ld      r30,-16(r1)
  74:   f8 ff e1 eb     ld      r31,-8(r1)
  78:   d9 ff 41 f4     lxv     vs34,-48(r1)
  7c:   20 00 80 4e     blr

But it can be more efficient with direct move and vector merge, such as:

   0:   67 01 43 7c     mtvsrd  vs34,r3
   4:   68 00 61 80     lwz     r3,104(r1)
   8:   60 00 61 81     lwz     r11,96(r1)
   c:   67 01 64 7c     mtvsrd  vs35,r4
  10:   70 00 81 80     lwz     r4,112(r1)
  14:   67 01 03 7d     mtvsrd  vs40,r3
  18:   78 00 61 80     lwz     r3,120(r1)
  1c:   67 01 85 7c     mtvsrd  vs36,r5
  20:   67 01 a6 7c     mtvsrd  vs37,r6
  24:   67 01 07 7c     mtvsrd  vs32,r7
  28:   67 01 28 7c     mtvsrd  vs33,r8
  2c:   67 01 24 7d     mtvsrd  vs41,r4
  30:   80 00 81 80     lwz     r4,128(r1)
  34:   0c 10 43 10     vmrghb  v2,v3,v2
  38:   67 01 63 7c     mtvsrd  vs35,r3
  3c:   88 00 61 80     lwz     r3,136(r1)
  40:   67 01 eb 7c     mtvsrd  vs39,r11
  44:   0c 20 85 10     vmrghb  v4,v5,v4
  48:   67 01 a4 7c     mtvsrd  vs37,r4
  4c:   90 00 81 80     lwz     r4,144(r1)
  50:   0c 00 01 10     vmrghb  v0,v1,v0
  54:   67 01 23 7c     mtvsrd  vs33,r3
  58:   98 00 61 80     lwz     r3,152(r1)
  5c:   67 01 c9 7c     mtvsrd  vs38,r9
  60:   0c 38 e8 10     vmrghb  v7,v8,v7
  64:   67 01 04 7d     mtvsrd  vs40,r4
  68:   0c 48 63 10     vmrghb  v3,v3,v9
  6c:   67 01 23 7d     mtvsrd  vs41,r3
  70:   0c 28 a1 10     vmrghb  v5,v1,v5
  74:   67 01 2a 7c     mtvsrd  vs33,r10
  78:   0c 40 09 11     vmrghb  v8,v9,v8
  7c:   0c 30 21 10     vmrghb  v1,v1,v6
  80:   4c 11 44 10     vmrglh  v2,v4,v2
  84:   4c 39 63 10     vmrglh  v3,v3,v7
  88:   4c 29 88 10     vmrglh  v4,v8,v5
  8c:   4c 01 a1 10     vmrglh  v5,v1,v0
  90:   8c 19 64 10     vmrglw  v3,v4,v3
  94:   8c 11 45 10     vmrglw  v2,v5,v2
  98:   57 13 43 f0     xxmrgld vs34,vs35,vs34

[Bug target/96933] New: inefficient code for char/short vec CTOR

Reply via email to