Hi, On 2020/9/24 21:27, Richard Biener wrote: > On Thu, Sep 24, 2020 at 10:21 AM xionghu luo <luo...@linux.ibm.com> wrote: > > I'll just comment that > > xxperm 34,34,33 > xxinsertw 34,0,12 > xxperm 34,34,32 > > doesn't look like a variable-position insert instruction but > this is a variable whole-vector rotate plus an insert at index zero > followed by a variable whole-vector rotate. I'm not fluend in > ppc assembly but > > rlwinm 6,6,2,28,29 > mtvsrwz 0,5 > lvsr 1,0,6 > lvsl 0,0,6 > > possibly computes the shift masks for r33/r32? though > I do not see those registers mentioned...
For V4SI: rlwinm 6,6,2,28,29 // r6*4 mtvsrwz 0,5 // vs0 <- r5 (0xfe) lvsr 1,0,6 // vs33 <- lvsr[r6] lvsl 0,0,6 // vs32 <- lvsl[r6] xxperm 34,34,33 xxinsertw 34,0,12 xxperm 34,34,32 blr idx = idx * 4; 0 0 0x4000000030000000200000001 xxperm:0x4000000030000000200000001 vs33:0x101112131415161718191a1b1c1d1e1f vs32:0x102030405060708090a0b0c0d0e0f 1 4 0x4000000030000000200000001 xxperm:0x1000000040000000300000002 vs33:0xc0d0e0f101112131415161718191a1b vs32:0x405060708090a0b0c0d0e0f10111213 2 8 0x4000000030000000200000001 xxperm:0x2000000010000000400000003 vs33:0x8090a0b0c0d0e0f1011121314151617 vs32:0x8090a0b0c0d0e0f1011121314151617 3 12 0x4000000030000000200000001 xxperm:0x3000000020000000100000004 vs33:0x405060708090a0b0c0d0e0f10111213 vs32:0xc0d0e0f101112131415161718191a1b vs34: 0x40000000300000002000000fe 0x400000003000000fe00000001 0x4000000fe0000000200000001 0xfe000000030000000200000001 "xxinsertw 34,0,12" will always insert vs0[32:63] content to the forth word of target vector, bits[96:127]. Then the second xxperm rotate the modified vector back. All the instructions are register based operation, as Segher replied, power9 supports only fixed position inserts, so we need do some trick here to support it instead of generate short store wide load instructions. > > This might be a generic viable expansion strathegy btw, > which is why I asked before whether the CPU supports > inserts at a variable position ... the building blocks are > already there with vec_set at constant zero position > plus vec_perm_const for the rotates. > > But well, I did ask this question. Multiple times. > > ppc does _not_ have a VSX instruction > like xxinsertw r34, r8, r12 where r8 denotes > the vector element (or byte position or whatever). > > So I don't think vec_set with a variable index is the > best approach. > Xionghu - you said even without the patch the stack > storage is eventually elided but > > addi 9,1,-16 > rldic 6,6,2,60 > stxv 34,-16(1) > stwx 5,9,6 > lxv 34,-16(1) > > still shows stack(?) store/load with a bad STLF penalty. Sorry that if I didn't describe clearly and misunderstood you, I mean if insert many instructions(tested with a loop inserted) between "stwx 5,9,6" and "lxv 34,-16(1)", the store hit load performance issue could be elided, but this is not the solution we want. I also changed your test as below and build for X86, seems it also generates inefficient code? What my patch does maybe different usage from your pasted case? #define N 32 typedef int T; typedef T V __attribute__((vector_size(N))); V setg3 (V v, int idx, T val) { v[idx&31] = val; return v; } -O2 -S -mavx -march=znver2: setg3: push rbp and edi, 31 mov rbp, rsp and rsp, -32 vmovdqa YMMWORD PTR [rsp-32], ymm0 mov DWORD PTR [rsp-32+rdi*4], esi vmovdqa ymm0, YMMWORD PTR [rsp-32] leave ret While idx is constant: setg3: vpinsrd xmm1, xmm0, esi, 3 vinserti128 ymm0, ymm0, xmm1, 0x0 ret And ARM with -O2 -S -march=armv8.2-a+sve (N change to 16): setg3: sub sp, sp, #16 and x0, x0, 15 str q0, [sp] str w1, [sp, x0, lsl 2] ldr q0, [sp] add sp, sp, 16 ret While idx is constant: setg3: ins v0.s[3], w1 ret Though I've no idea how to optimize this on X86 and ARM with vector instructions to avoid short store with wide load followed on stack. Thanks, Xionghu