Hi,

On 2020/9/24 21:27, Richard Biener wrote:
> On Thu, Sep 24, 2020 at 10:21 AM xionghu luo <luo...@linux.ibm.com> wrote:
> 
> I'll just comment that
> 
>          xxperm 34,34,33
>          xxinsertw 34,0,12
>          xxperm 34,34,32
> 
> doesn't look like a variable-position insert instruction but
> this is a variable whole-vector rotate plus an insert at index zero
> followed by a variable whole-vector rotate.  I'm not fluend in
> ppc assembly but
> 
>          rlwinm 6,6,2,28,29
>          mtvsrwz 0,5
>          lvsr 1,0,6
>          lvsl 0,0,6
> 
> possibly computes the shift masks for r33/r32?  though
> I do not see those registers mentioned...

For V4SI:
       rlwinm 6,6,2,28,29      // r6*4
       mtvsrwz 0,5             // vs0   <- r5  (0xfe)
       lvsr 1,0,6              // vs33  <- lvsr[r6]
       lvsl 0,0,6              // vs32  <- lvsl[r6] 
       xxperm 34,34,33       
       xxinsertw 34,0,12
       xxperm 34,34,32
       blr


idx = idx * 4; 
0    0      0x4000000030000000200000001   xxperm:0x4000000030000000200000001   
vs33:0x101112131415161718191a1b1c1d1e1f  vs32:0x102030405060708090a0b0c0d0e0f
1    4      0x4000000030000000200000001   xxperm:0x1000000040000000300000002   
vs33:0xc0d0e0f101112131415161718191a1b   vs32:0x405060708090a0b0c0d0e0f10111213
2    8      0x4000000030000000200000001   xxperm:0x2000000010000000400000003   
vs33:0x8090a0b0c0d0e0f1011121314151617   vs32:0x8090a0b0c0d0e0f1011121314151617
3    12     0x4000000030000000200000001   xxperm:0x3000000020000000100000004   
vs33:0x405060708090a0b0c0d0e0f10111213   vs32:0xc0d0e0f101112131415161718191a1b

vs34:
 0x40000000300000002000000fe
 0x400000003000000fe00000001
 0x4000000fe0000000200000001
0xfe000000030000000200000001


"xxinsertw 34,0,12" will always insert vs0[32:63] content to the forth word of
target vector, bits[96:127].  Then the second xxperm rotate the modified vector
back. 

All the instructions are register based operation, as Segher replied, power9
supports only fixed position inserts, so we need do some trick here to support
it instead of generate short store wide load instructions.


> 
> This might be a generic viable expansion strathegy btw,
> which is why I asked before whether the CPU supports
> inserts at a variable position ...  the building blocks are
> already there with vec_set at constant zero position
> plus vec_perm_const for the rotates.
> 
> But well, I did ask this question.  Multiple times.
> 
> ppc does _not_ have a VSX instruction
> like xxinsertw r34, r8, r12 where r8 denotes
> the vector element (or byte position or whatever).
> 
> So I don't think vec_set with a variable index is the
> best approach.
> Xionghu - you said even without the patch the stack
> storage is eventually elided but
> 
>          addi 9,1,-16
>          rldic 6,6,2,60
>          stxv 34,-16(1)
>          stwx 5,9,6
>          lxv 34,-16(1)
> 
> still shows stack(?) store/load with a bad STLF penalty.


Sorry that if I didn't describe clearly and misunderstood you, I mean if insert 
many
instructions(tested with a loop inserted) between "stwx 5,9,6" and "lxv 
34,-16(1)",
the store hit load performance issue could be elided, but this is not the 
solution
we want.

I also changed your test as below and build for X86, seems it also generates
inefficient code?  What my patch does maybe different usage from your pasted
case? 

#define N 32
typedef int T;
typedef T V __attribute__((vector_size(N)));
  V setg3 (V v, int idx, T val)
{
    v[idx&31] = val;
    return v;
}

-O2 -S -mavx -march=znver2:

setg3:
        push    rbp
        and     edi, 31
        mov     rbp, rsp
        and     rsp, -32
        vmovdqa YMMWORD PTR [rsp-32], ymm0
        mov     DWORD PTR [rsp-32+rdi*4], esi
        vmovdqa ymm0, YMMWORD PTR [rsp-32]
        leave
        ret


While idx is constant: 

setg3:
        vpinsrd xmm1, xmm0, esi, 3
        vinserti128     ymm0, ymm0, xmm1, 0x0
        ret

And ARM with -O2 -S -march=armv8.2-a+sve (N change to 16): 

setg3:
        sub     sp, sp, #16
        and     x0, x0, 15
        str     q0, [sp]
        str     w1, [sp, x0, lsl 2]
        ldr     q0, [sp]
        add     sp, sp, 16
        ret

While idx is constant: 

setg3:
        ins     v0.s[3], w1
        ret


Though I've no idea how to optimize this on X86 and ARM with vector instructions
to avoid short store with wide load followed on stack.


Thanks,
Xionghu

Reply via email to