On 11/17/2017 10:30 PM, Peter Bergner wrote: > On POWER8 in little endian mode, the lxvd2x and stxvd2x instructions we use > for loading and storing vectors do not perform a byte swap as part of their > operation. This means we have to explicitly byte swap a vector before we > store to memory and byte swap it after we load it from memory. These swaps > are explicit in our RTL and we decorate our load/stores with swaps as well. > A store operation would then look something like: > > (insn 9 8 10 (set (reg:V4SI 126 [ bD.2787 ]) > (vec_select:V4SI (reg/v:V4SI 124 [ bD.2787 ]) > (parallel [ > (const_int 2 [0x2]) > (const_int 3 [0x3]) > (const_int 0 [0]) > (const_int 1 [0x1]) > ]))) "pr69493-6.c":13 -1 > (nil)) > > (insn 10 9 0 (set (mem/c:V4SI (plus:DI (reg/f:DI 116 virtual-stack-vars) > (const_int 16 [0x10])) [1 MEM[(struct *)&D.2792 + 16B]+0 S16 > A128]) > (vec_select:V4SI (reg:V4SI 126 [ bD.2787 ]) > (parallel [ > (const_int 2 [0x2]) > (const_int 3 [0x3]) > (const_int 0 [0]) > (const_int 1 [0x1]) > ]))) "pr69493-6.c":13 -1 > (nil)) > > There is similar code for loads. Correctness wise, this is all fine and > dandy, > but the swaps are inhibiting optimization of simple test cases. For example, > the following test case passes in two vector arguments and returns a struct > that holds those vectors. > > typedef struct > { > __vector int vx0; > __vector int vx1; > } vec_t; > > vec_t > test_big_double (__vector int a, __vector int b) > { > vec_t result; > result.vx0 = a; > result.vx1 = b; > return result; > } > > For our ELFv2 ABI, the vectors are passed in via registers vs34 and vs35 > and the struct is returned in registers, also vs34 and vs35. Ideally, this > should be one large nop and the only instruction generated should be the > blr return insn...and that is what we get on POWER8 BE when compiling with > -mabi=elfv2 (elfv1 doesn't return structs via regs). > > However, on POWER8 LE, we generate the horrible and useless code: > > addi 8,1,-96 > li 10,32 > xxpermdi 34,34,34,2 > xxpermdi 35,35,35,2 > li 9,48 > stxvd2x 34,8,10 > stxvd2x 35,8,9 > lxvd2x 34,8,10 > lxvd2x 35,8,9 > xxpermdi 34,34,34,2 > xxpermdi 35,35,35,2 > blr > > Looking at what happens on BE (-O2 -mcpu=power8 -mabi=elfv2), we start with > the following gimple just before expand: > > test_big_double (__vector signed intD.1461 aD.2786, __vector signed intD.1461 > bD.2787) > { > __vector signed intD.1461 a_2(D) = aD.2786; > __vector signed intD.1461 b_3(D) = bD.2787; > struct vec_tD.2785 D.2792; > > ;; basic block 2, loop depth 0 > ;; pred: ENTRY > # .MEM_4 = VDEF <.MEM_1(D)> > MEM[(struct *)&D.2792] = a_2(D); > # .MEM_5 = VDEF <.MEM_4> > MEM[(struct *)&D.2792 + 16B] = b_3(D); > # VUSE <.MEM_5> > return D.2792; > ;; succ: EXIT > > } > > This gets expanded to (only showing the rtl for the first vector field): > > (insn 2 5 3 2 (set (reg/v:V4SI 123 [ aD.2713 ]) > (reg:V4SI 79 2 [ aD.2713 ])) "pr69493-6.c":9 1121 {*vsx_movv4si_64bit} > (nil)) > (insn 21 4 7 2 (set (reg:DI 127) > (const_int 32 [0x20])) "pr69493-6.c":13 -1 > (nil)) > (insn 7 21 22 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 111 sfp) > (reg:DI 127)) [1 MEM[(struct *)&D.2719]+0 S16 A128]) > (reg/v:V4SI 123 [ aD.2713 ])) "pr69493-6.c":13 1121 > {*vsx_movv4si_64bit} > (nil)) > (insn 23 8 9 2 (set (reg:DI 129) > (const_int 32 [0x20])) "pr69493-6.c":13 -1 > (nil)) > (insn 9 23 24 2 (set (reg:V4SI 125) > (mem/c:V4SI (plus:DI (reg/f:DI 111 sfp) > (reg:DI 129)) [1 D.2719+0 S16 A128])) "pr69493-6.c":13 1121 > {*vsx_movv4si_64bit} > (nil)) > (insn 11 10 12 2 (set (reg:V4SI 121 [ <retval> ]) > (reg:V4SI 125)) "pr69493-6.c":13 1121 {*vsx_movv4si_64bit} > (nil)) > (insn 16 12 17 2 (set (reg:V4SI 79 2) > (reg:V4SI 121 [ <retval> ])) "pr69493-6.c":14 1121 > {*vsx_movv4si_64bit} > (nil)) > (insn 18 17 19 2 (use (reg:V4SI 79 2)) "pr69493-6.c":14 -1 > (nil)) > > CSE1 comes along and replaces the MEM in insn 9 with pseudo 123 since they > are equivalent. This makes the store in insn 7 dead and DSE deletes it later. > This leaves us with simple reg copies which all get cleaned up leaving us > with a simple return. CSE sees the equivalence between pseudo 123 and the > MEM via the assignment in insn 7. However, on LE, we get the following > expanded rtl (again, only showing the rtl for the first vector field): > > (insn 2 5 3 2 (set (reg/v:V4SI 123 [ aD.2786 ]) > (reg:V4SI 79 2 [ aD.2786 ])) "pr69493-6.c":9 1047 {*vsx_movv4si_64bit} > (nil)) > > (insn 7 4 25 2 (set (reg:V4SI 125 [ aD.2786 ]) > (vec_select:V4SI (reg/v:V4SI 123 [ aD.2786 ]) > (parallel [ > (const_int 2 [0x2]) > (const_int 3 [0x3]) > (const_int 0 [0]) > (const_int 1 [0x1]) > ]))) "pr69493-6.c":13 1218 {*vsx_xxpermdi4_le_v4si} > (nil)) > (insn 25 7 8 2 (set (reg:DI 131) > (const_int 32 [0x20])) "pr69493-6.c":13 -1 > (nil)) > (insn 8 25 9 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 111 sfp) > (reg:DI 131)) [1 MEM[(struct *)&D.2792]+0 S16 A128]) > (vec_select:V4SI (reg:V4SI 125 [ aD.2786 ]) > (parallel [ > (const_int 2 [0x2]) > (const_int 3 [0x3]) > (const_int 0 [0]) > (const_int 1 [0x1]) > ]))) "pr69493-6.c":13 1230 {*vsx_stxvd2x4_le_v4si} > (nil)) > > > (insn 27 10 11 2 (set (reg:DI 133) > (const_int 32 [0x20])) "pr69493-6.c":13 -1 > (nil)) > (insn 11 27 12 2 (set (reg:V4SI 128) > (vec_select:V4SI (mem/c:V4SI (plus:DI (reg/f:DI 111 sfp) > (reg:DI 133)) [1 D.2792+0 S16 A128]) > (parallel [ > (const_int 2 [0x2]) > (const_int 3 [0x3]) > (const_int 0 [0]) > (const_int 1 [0x1]) > ]))) "pr69493-6.c":13 1224 {*vsx_lxvd2x4_le_v4si} > (nil)) > (insn 12 11 28 2 (set (reg:V4SI 127) > (vec_select:V4SI (reg:V4SI 128) > (parallel [ > (const_int 2 [0x2]) > (const_int 3 [0x3]) > (const_int 0 [0]) > (const_int 1 [0x1]) > ]))) "pr69493-6.c":13 1218 {*vsx_xxpermdi4_le_v4si} > (nil)) > ... > > In this case, we have pseudo 125 is equivalent to a byte swapped pseudo 123, > pseudo 123 is equivalent to a byte swapped MEM and MEM is equivalent to a > byte swapped pseudo 125. Or in pseudo code, we have the following > equivalences: > > equiv buckets after insn 7: > *) 125 equiv SWAP(123), > equiv buckets after insn 8: > *) 125 equiv SWAP(123), > *) MEM equiv SWAP(125) > > When we scan insn 11, we do not see MEM being equivalent to pseudo 123, > so we don't do the replacement. > > However, if we look closer, we have MEM is equiv to SWAP(125), which is > the same as SWAP(SWAP(123)), which is 123, since SWAP is an involutory > operation! CSE doesn't see this, because we don't have the right > equivalences recorded into the equivalence table. > > My idea on how to "fix" this (which I'd like some comments on), is to check > when we insert an equivalence between A and B into the table to is check > whether B is an expression with an involutory operation OP(INNER_B) (where OP > is a byte swap in our case) and if so, then I also insert an equivalence > between INNER_B and OP(A). If we look at how that handles the test case > above, > we get: > > equiv buckets after insn 7: > *) 125 equiv SWAP(123), > *) 123 equiv SWAP(125) > > Now, when we scan insn 8, we add the normal equivalence MEM equiv SWAP(125), > and we see that SWAP(125) already is equivalent to 123, so we add MEM to > that equivalence bucket. We then add the involutory equivalence of > 125 equiv SWAP(MEM), which also already exists, leading us to the following > equivalence buckets after insn 8: > > equiv buckets after insn 8: > *) 125 equiv SWAP(123) equiv SWAP(MEM) > *) 123 equiv SWAP(125) equiv MEM > > Now, when we scan insn 11, we see that MEM is equivalent to pseudo 123 > and we make the replacement, which leads us to optimize the function > similarly to how BE did, which results in just a "blr" insn being generated. > > So it seems like I'm done! ...unfortunately no. :-( If I change the vectors > from V4SI to V2DF like the test case below: > > typedef struct > { > __vector double vx0; > __vector double vx1; > } vec_t; > > vec_t > test_big_double (__vector double a, __vector double b) > { > vec_t result; > result.vx0 = a; > result.vx1 = b; > return result; > } > > ...then I again have problems. In this case, it's due to exp_equiv_p() not > noticing when an expression is equivalent to itself, causing us to not > insert one of our new involutory equivalences into an preexisting equivalence > bucket, but into a new equivalence bucket, which causes us to not notice > the equivalence we need to make the replacement. I debugged this down to > what I think is a bug in exp_equiv_p(). It uses the test: > > if (REG_IN_TABLE(i) != REG_TICK(i)) > return 0; > > to try and determine whether a register/pseudo has just been set after we've > already processed an earlier reference to that register/pseudo. If so, we > reject matching it by returning 0. It seems to me, that we should only reject > these equivalences when we've actually seen an earlier register/pseudo > reference, > meaning when REG_IN_TABLE(i) != -1. Changing the above test to: > > if (REG_IN_TABLE (i) != -1 && REG_IN_TABLE (i) != REG_TICK (i)) > return 0; > > ...fixes that problem and again, we can optimize this new test case to just > a "blr" insn. > > There are still some cases I cannot optimize yet, let the above 2nd test case, > but instead if passing in vectors, I pass in a struct and assign its fields > one by one, rather than doing a struct assignment, like so: > > typedef struct > { > __vector double vx0; > __vector double vx1; > } vec_t; > > vec_t > test_big_double (vec_t a) > { > vec_t result; > result.vx0 = a.vx0; > result.vx1 = a.vx1; > return result; > } > > Then I get the same bad code as the original test case. Again, if we change > the type from double to int, I get optimal code. > > Question for people, am I attacking this problem the correct way? Do people > see a problem with me adding extra equivalences to the equivalence table? > If so, do you have a different suggestion on how to attack this problem? > > I have attached the patch I am using below. The "important" changes are > to cse.c. The change to the rs6000 files are basically just a way to telling > cse.c that the vec_select it's seeing is a involutory byte swap. I did > have to make one change to rs6000_rtx_costs() to modify the cost of the > byte swap, because a byte swap of even a reg was many many time higher > than a simple bare mem operand. You might wander a bit and see if/how cse handles other similar circumstances. For example (not (not (x)) (neg (neg x)) and (bswap (bswap (x))
THe last would seem particularly interesting -- as a hack see if you can generate a bswap instead of vec_select at expansion time, then see if CSE is able to fix it up. Or perhaps set it as a REG_EQUAL note. Again, it's a hack, but you just want to see if CSE can see the equivalence if it's in a more common form. Though I'm not sure if BSWAP is handled significantly better than an equivalence VEC_SELECT. If it is, trace how it happens. Similarly see if anything useful gets recorded for the other similar operators. There may be a simplification missing somewhere that's preventing cse from seeing the equivalences. It's been a long time since I worked regularly in this code. Jeff