Hi, On 2020/9/1 21:07, Richard Biener wrote: > On Tue, Sep 1, 2020 at 10:11 AM luoxhu via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: >> >> Hi, >> >> On 2020/9/1 01:04, Segher Boessenkool wrote: >>> Hi! >>> >>> On Mon, Aug 31, 2020 at 04:06:47AM -0500, Xiong Hu Luo wrote: >>>> vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value >>>> to be insert, arg2 is the place to insert arg1 to arg0. This patch adds >>>> __builtin_vec_insert_v4si[v4sf,v2di,v2df,v8hi,v16qi] for vec_insert to >>>> not expand too early in gimple stage if arg2 is variable, to avoid generate >>>> store hit load instructions. >>>> >>>> For Power9 V4SI: >>>> addi 9,1,-16 >>>> rldic 6,6,2,60 >>>> stxv 34,-16(1) >>>> stwx 5,9,6 >>>> lxv 34,-16(1) >>>> => >>>> addis 9,2,.LC0@toc@ha >>>> addi 9,9,.LC0@toc@l >>>> mtvsrwz 33,5 >>>> lxv 32,0(9) >>>> sradi 9,6,2 >>>> addze 9,9 >>>> sldi 9,9,2 >>>> subf 9,9,6 >>>> subfic 9,9,3 >>>> sldi 9,9,2 >>>> subfic 9,9,20 >>>> lvsl 13,0,9 >>>> xxperm 33,33,45 >>>> xxperm 32,32,45 >>>> xxsel 34,34,33,32 >>> >>> For v a V4SI, x a SI, j some int, what do we generate for >>> v[j&3] = x; >>> ? >>> This should be exactly the same as we generate for >>> vec_insert(x, v, j); >>> (the builtin does a modulo 4 automatically). >> >> No, even with my patch "stxv 34,-16(1);stwx 5,9,6;lxv 34,-16(1)" generated >> currently. >> Is it feasible and acceptable to expand some kind of pattern in expander >> directly without >> builtin transition? >> >> I borrowed some of implementation from vec_extract. For vec_extract, the >> issue also exists: >> >> source: gimple: >> expand: asm: >> 1) i = vec_extract (v, n); => i = __builtin_vec_ext_v4si (v, n); => >> {r120:SI=unspec[r118:V4SI,r119:DI] 134;...} => slwi 9,6,2 vextuwrx >> 3,9,2 >> 2) i = vec_extract (v, 3); => i = __builtin_vec_ext_v4si (v, 3); => >> {r120:SI=vec_select(r118:V4SI,parallel)...} => li 9,12 vextuwrx >> 3,9,2 >> 3) i = v[n%4]; => _1 = n & 3; i = VIEW_CONVERT_EXPR<int[4]>(v)[_1]; => >> ... => stxv 34,-16(1);addi 9,1,-16; rldic 5,5,2,60; lwax >> 3,9,5 >> 4) i = v[3]; => i = BIT_FIELD_REF <v, 32, 96>; => >> {r120:SI=vec_select(r118:V4SI,parallel)...} => li 9,12; vextuwrx 3,9,2 > > Why are 1) and 2) handled differently than 3)/4)?
1) and 2) are calling builtin function vec_extract, it is defined to __builtin_vec_extract and will be resolved to ALTIVEC_BUILTIN_VEC_EXTRACT by resolve_overloaded_builtin, to generate a call __builtin_vec_ext_v4si to be expanded only in RTL. 3) is access variable v as array type with opcode VIEW_CONVERT_EXPR, I guess we should also generate builtin call instead of calling convert_vector_to_array_for_subscript to generate VIEW_CONVERT_EXPR expression for such kind of usage. 4) is translated to BIT_FIELD_REF with constant bitstart and bitsize, variable v could also be accessed by register instead of stack, so optabs could match the rs6000_expand_vector_insert to generate expected instruction through extract_bit_field. > >> Case 3) also couldn't handle the similar usage, and case 4) doesn't generate >> builtin as expected, >> it just expand to vec_select by coincidence. So does this mean both >> vec_insert and vec_extract >> and all other similar vector builtins should use IFN as suggested by Richard >> Biener, to match the >> pattern in gimple and expand both constant and variable index in expander? >> Will this also be >> beneficial for other targets except power? Or we should do that gradually >> after this patch >> approved as it seems another independent issue? Thanks:) > > If the code generated for 3)/4) isn't optimal you have to figure why > by tracing the RTL > expansion code and looking for missing optabs. > > Consider the amount of backend code you need to write if ever using > those in constexpr > context ... It seems too complicated to expand the "i = VIEW_CONVERT_EXPR<int[4]>(v)[_1];" or "VIEW_CONVERT_EXPR<int[4]>(v1)[_1] = i_6(D);" to rs6000_expand_vector_insert/rs6000_expand_vector_extract in RTL, as: 1) Vector v is stored to stack with array type; need extra load and store operation. 2) Requires amount of code to decompose VIEW_CONVERT_EXPR to extract the vector and index then call rs6000_expand_vector_insert/rs6000_expand_vector_extract. which means replace following instructions #9~#12 to new instruction sequences: 1: NOTE_INSN_DELETED 6: NOTE_INSN_BASIC_BLOCK 2 2: r119:V4SI=%2:V4SI 3: r120:DI=%5:DI 4: r121:DI=%6:DI 5: NOTE_INSN_FUNCTION_BEG 8: [r112:DI]=r119:V4SI 9: r122:DI=r121:DI&0x3 10: r123:DI=r122:DI<<0x2 11: r124:DI=r112:DI+r123:DI 12: [r124:DI]=r120:DI#0 13: r118:V4SI=[r112:DI] 17: %2:V4SI=r118:V4SI 18: use %2:V4SI => 1: NOTE_INSN_DELETED 6: NOTE_INSN_BASIC_BLOCK 2 2: r119:V4SI=%2:V4SI 3: r120:DI=%5:DI 4: r121:DI=%6:DI 5: NOTE_INSN_FUNCTION_BEG 8: [r112:DI]=r119:V4SI r130:V4SI=[r112:DI] rs6000_expand_vector_insert (r130, r121:DI&0x3, r120:DI#0) [r112:DI]=r130:V4SI 13: r118:V4SI=[r112:DI] 17: %2:V4SI=r118:V4SI 18: use %2:V4SI so maybe bypass convert_vector_to_array_for_subscript for special circumstance like "i = v[n%4]" or "v[n&3]=i" to generate vec_extract or vec_insert builtin call a relative simpler method? Xionghu