https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933
Bug ID: 96933 Summary: inefficient code for char/short vec CTOR Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: linkw at gcc dot gnu.org Target Milestone: --- When I'm investigate the vectorization cost for vec_construct, I happened to find the generated code for vector construction is inefficient with DIRECT_MOVE support. The test case looks like: vector unsigned char test_char(unsigned char f1, unsigned char f2, unsigned char f3, unsigned char f4, unsigned char f5, unsigned char f6, unsigned char f7, unsigned char f8, unsigned char f9, unsigned char f10, unsigned char f11, unsigned char f12, unsigned char f13, unsigned char f14, unsigned char f15, unsigned char f16) { vector unsigned char v = {f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11, f12, f13, f14, f15, f16}; return v; } The generated code currently with -mcpu=power9: 0000000000000000 <test_char>: 0: e8 ff a1 fb std r29,-24(r1) 4: f0 ff c1 fb std r30,-16(r1) 8: f8 ff e1 fb std r31,-8(r1) c: 60 00 a1 8b lbz r29,96(r1) 10: 68 00 c1 8b lbz r30,104(r1) 14: 70 00 e1 8b lbz r31,112(r1) 18: d1 ff 81 98 stb r4,-47(r1) 1c: d2 ff a1 98 stb r5,-46(r1) 20: 78 00 81 89 lbz r12,120(r1) 24: 80 00 01 88 lbz r0,128(r1) 28: 88 00 61 89 lbz r11,136(r1) 2c: 90 00 81 88 lbz r4,144(r1) 30: 98 00 a1 88 lbz r5,152(r1) 34: d0 ff 61 98 stb r3,-48(r1) 38: d3 ff c1 98 stb r6,-45(r1) 3c: d4 ff e1 98 stb r7,-44(r1) 40: d8 ff a1 9b stb r29,-40(r1) 44: d5 ff 01 99 stb r8,-43(r1) 48: d6 ff 21 99 stb r9,-42(r1) 4c: d7 ff 41 99 stb r10,-41(r1) 50: d9 ff c1 9b stb r30,-39(r1) 54: da ff e1 9b stb r31,-38(r1) 58: db ff 81 99 stb r12,-37(r1) 5c: dc ff 01 98 stb r0,-36(r1) 60: dd ff 61 99 stb r11,-35(r1) 64: de ff 81 98 stb r4,-34(r1) 68: df ff a1 98 stb r5,-33(r1) 6c: e8 ff a1 eb ld r29,-24(r1) 70: f0 ff c1 eb ld r30,-16(r1) 74: f8 ff e1 eb ld r31,-8(r1) 78: d9 ff 41 f4 lxv vs34,-48(r1) 7c: 20 00 80 4e blr But it can be more efficient with direct move and vector merge, such as: 0: 67 01 43 7c mtvsrd vs34,r3 4: 68 00 61 80 lwz r3,104(r1) 8: 60 00 61 81 lwz r11,96(r1) c: 67 01 64 7c mtvsrd vs35,r4 10: 70 00 81 80 lwz r4,112(r1) 14: 67 01 03 7d mtvsrd vs40,r3 18: 78 00 61 80 lwz r3,120(r1) 1c: 67 01 85 7c mtvsrd vs36,r5 20: 67 01 a6 7c mtvsrd vs37,r6 24: 67 01 07 7c mtvsrd vs32,r7 28: 67 01 28 7c mtvsrd vs33,r8 2c: 67 01 24 7d mtvsrd vs41,r4 30: 80 00 81 80 lwz r4,128(r1) 34: 0c 10 43 10 vmrghb v2,v3,v2 38: 67 01 63 7c mtvsrd vs35,r3 3c: 88 00 61 80 lwz r3,136(r1) 40: 67 01 eb 7c mtvsrd vs39,r11 44: 0c 20 85 10 vmrghb v4,v5,v4 48: 67 01 a4 7c mtvsrd vs37,r4 4c: 90 00 81 80 lwz r4,144(r1) 50: 0c 00 01 10 vmrghb v0,v1,v0 54: 67 01 23 7c mtvsrd vs33,r3 58: 98 00 61 80 lwz r3,152(r1) 5c: 67 01 c9 7c mtvsrd vs38,r9 60: 0c 38 e8 10 vmrghb v7,v8,v7 64: 67 01 04 7d mtvsrd vs40,r4 68: 0c 48 63 10 vmrghb v3,v3,v9 6c: 67 01 23 7d mtvsrd vs41,r3 70: 0c 28 a1 10 vmrghb v5,v1,v5 74: 67 01 2a 7c mtvsrd vs33,r10 78: 0c 40 09 11 vmrghb v8,v9,v8 7c: 0c 30 21 10 vmrghb v1,v1,v6 80: 4c 11 44 10 vmrglh v2,v4,v2 84: 4c 39 63 10 vmrglh v3,v3,v7 88: 4c 29 88 10 vmrglh v4,v8,v5 8c: 4c 01 a1 10 vmrglh v5,v1,v0 90: 8c 19 64 10 vmrglw v3,v4,v3 94: 8c 11 45 10 vmrglw v2,v5,v2 98: 57 13 43 f0 xxmrgld vs34,vs35,vs34