Re: [RFC][PATCH. CSE. rs6000] Update CSE to handle involutory operations

Jeff Law Fri, 17 Nov 2017 22:17:39 -0800

On 11/17/2017 10:30 PM, Peter Bergner wrote:
> On POWER8 in little endian mode, the lxvd2x and stxvd2x instructions we use
> for loading and storing vectors do not perform a byte swap as part of their
> operation.  This means we have to explicitly byte swap a vector before we
> store to memory and byte swap it after we load it from memory.  These swaps
> are explicit in our RTL and we decorate our load/stores with swaps as well.
> A store operation would then look something like:
> 
> (insn 9 8 10 (set (reg:V4SI 126 [ bD.2787 ])
>         (vec_select:V4SI (reg/v:V4SI 124 [ bD.2787 ])
>             (parallel [
>                     (const_int 2 [0x2])
>                     (const_int 3 [0x3])
>                     (const_int 0 [0])
>                     (const_int 1 [0x1])
>                 ]))) "pr69493-6.c":13 -1
>      (nil))
> 
> (insn 10 9 0 (set (mem/c:V4SI (plus:DI (reg/f:DI 116 virtual-stack-vars)
>                 (const_int 16 [0x10])) [1 MEM[(struct  *)&D.2792 + 16B]+0 S16 
> A128])
>         (vec_select:V4SI (reg:V4SI 126 [ bD.2787 ])
>             (parallel [
>                     (const_int 2 [0x2])
>                     (const_int 3 [0x3])
>                     (const_int 0 [0])
>                     (const_int 1 [0x1])
>                 ]))) "pr69493-6.c":13 -1
>      (nil))
> 
> There is similar code for loads.  Correctness wise, this is all fine and 
> dandy,
> but the swaps are inhibiting optimization of simple test cases.  For example,
> the following test case passes in two vector arguments and returns a struct
> that holds those vectors.
> 
> typedef struct
> {
>   __vector int vx0;
>   __vector int vx1;
> } vec_t;
> 
> vec_t
> test_big_double (__vector int a, __vector int b)
> {
>   vec_t result;
>   result.vx0 = a;
>   result.vx1 = b;
>   return result;
> }
> 
> For our ELFv2 ABI, the vectors are passed in via registers vs34 and vs35
> and the struct is returned in registers, also vs34 and vs35.  Ideally, this
> should be one large nop and the only instruction generated should be the
> blr return insn...and that is what we get on POWER8 BE when compiling with
> -mabi=elfv2 (elfv1 doesn't return structs via regs).
> 
> However, on POWER8 LE, we generate the horrible and useless code:
> 
>       addi 8,1,-96
>       li 10,32
>       xxpermdi 34,34,34,2
>       xxpermdi 35,35,35,2
>       li 9,48
>       stxvd2x 34,8,10
>       stxvd2x 35,8,9
>       lxvd2x 34,8,10
>       lxvd2x 35,8,9
>       xxpermdi 34,34,34,2
>       xxpermdi 35,35,35,2
>       blr
> 
> Looking at what happens on BE (-O2 -mcpu=power8 -mabi=elfv2), we start with
> the following gimple just before expand:
> 
> test_big_double (__vector signed intD.1461 aD.2786, __vector signed intD.1461 
> bD.2787)
> {
>   __vector signed intD.1461 a_2(D) = aD.2786;
>   __vector signed intD.1461 b_3(D) = bD.2787;
>   struct vec_tD.2785 D.2792;
> 
> ;;   basic block 2, loop depth 0
> ;;    pred:       ENTRY
>   # .MEM_4 = VDEF <.MEM_1(D)>
>   MEM[(struct  *)&D.2792] = a_2(D);
>   # .MEM_5 = VDEF <.MEM_4>
>   MEM[(struct  *)&D.2792 + 16B] = b_3(D);
>   # VUSE <.MEM_5>
>   return D.2792;
> ;;    succ:       EXIT
> 
> }
> 
> This gets expanded to (only showing the rtl for the first vector field):
> 
> (insn 2 5 3 2 (set (reg/v:V4SI 123 [ aD.2713 ])
>         (reg:V4SI 79 2 [ aD.2713 ])) "pr69493-6.c":9 1121 {*vsx_movv4si_64bit}
>      (nil))
> (insn 21 4 7 2 (set (reg:DI 127)
>         (const_int 32 [0x20])) "pr69493-6.c":13 -1
>      (nil))
> (insn 7 21 22 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 111 sfp)
>                 (reg:DI 127)) [1 MEM[(struct  *)&D.2719]+0 S16 A128])
>         (reg/v:V4SI 123 [ aD.2713 ])) "pr69493-6.c":13 1121 
> {*vsx_movv4si_64bit}
>      (nil))
> (insn 23 8 9 2 (set (reg:DI 129)
>         (const_int 32 [0x20])) "pr69493-6.c":13 -1
>      (nil))
> (insn 9 23 24 2 (set (reg:V4SI 125)
>         (mem/c:V4SI (plus:DI (reg/f:DI 111 sfp)
>                 (reg:DI 129)) [1 D.2719+0 S16 A128])) "pr69493-6.c":13 1121 
> {*vsx_movv4si_64bit}
>      (nil))
> (insn 11 10 12 2 (set (reg:V4SI 121 [ <retval> ])
>         (reg:V4SI 125)) "pr69493-6.c":13 1121 {*vsx_movv4si_64bit}
>      (nil))
> (insn 16 12 17 2 (set (reg:V4SI 79 2)
>         (reg:V4SI 121 [ <retval> ])) "pr69493-6.c":14 1121 
> {*vsx_movv4si_64bit}
>      (nil))
> (insn 18 17 19 2 (use (reg:V4SI 79 2)) "pr69493-6.c":14 -1
>      (nil))
> 
> CSE1 comes along and replaces the MEM in insn 9 with pseudo 123 since they
> are equivalent.  This makes the store in insn 7 dead and DSE deletes it later.
> This leaves us with simple reg copies which all get cleaned up leaving us
> with a simple return.  CSE sees the equivalence between pseudo 123 and the
> MEM via the assignment in insn 7.  However, on LE, we get the following
> expanded rtl (again, only showing the rtl for the first vector field):
> 
> (insn 2 5 3 2 (set (reg/v:V4SI 123 [ aD.2786 ])
>         (reg:V4SI 79 2 [ aD.2786 ])) "pr69493-6.c":9 1047 {*vsx_movv4si_64bit}
>      (nil))
> 
> (insn 7 4 25 2 (set (reg:V4SI 125 [ aD.2786 ])
>         (vec_select:V4SI (reg/v:V4SI 123 [ aD.2786 ])
>             (parallel [
>                     (const_int 2 [0x2])
>                     (const_int 3 [0x3])
>                     (const_int 0 [0])
>                     (const_int 1 [0x1])
>                 ]))) "pr69493-6.c":13 1218 {*vsx_xxpermdi4_le_v4si}
>      (nil))
> (insn 25 7 8 2 (set (reg:DI 131)
>         (const_int 32 [0x20])) "pr69493-6.c":13 -1
>      (nil))
> (insn 8 25 9 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 111 sfp)
>                 (reg:DI 131)) [1 MEM[(struct  *)&D.2792]+0 S16 A128])
>         (vec_select:V4SI (reg:V4SI 125 [ aD.2786 ])
>             (parallel [
>                     (const_int 2 [0x2])
>                     (const_int 3 [0x3])
>                     (const_int 0 [0])
>                     (const_int 1 [0x1])
>                 ]))) "pr69493-6.c":13 1230 {*vsx_stxvd2x4_le_v4si}
>      (nil))
> 
> 
> (insn 27 10 11 2 (set (reg:DI 133)
>         (const_int 32 [0x20])) "pr69493-6.c":13 -1
>      (nil))
> (insn 11 27 12 2 (set (reg:V4SI 128)
>         (vec_select:V4SI (mem/c:V4SI (plus:DI (reg/f:DI 111 sfp)
>                     (reg:DI 133)) [1 D.2792+0 S16 A128])
>             (parallel [
>                     (const_int 2 [0x2])
>                     (const_int 3 [0x3])
>                     (const_int 0 [0])
>                     (const_int 1 [0x1])
>                 ]))) "pr69493-6.c":13 1224 {*vsx_lxvd2x4_le_v4si}
>      (nil))
> (insn 12 11 28 2 (set (reg:V4SI 127)
>         (vec_select:V4SI (reg:V4SI 128)
>             (parallel [
>                     (const_int 2 [0x2])
>                     (const_int 3 [0x3])
>                     (const_int 0 [0])
>                     (const_int 1 [0x1])
>                 ]))) "pr69493-6.c":13 1218 {*vsx_xxpermdi4_le_v4si}
>      (nil))
> ...
> 
> In this case, we have pseudo 125 is equivalent to a byte swapped pseudo 123,
> pseudo 123 is equivalent to a byte swapped MEM and MEM is equivalent to a
> byte swapped pseudo 125.  Or in pseudo code, we have the following 
> equivalences:
> 
> equiv buckets after insn 7:
>     *) 125 equiv SWAP(123),
> equiv buckets after insn 8:
>     *) 125 equiv SWAP(123),
>     *) MEM equiv SWAP(125)
> 
> When we scan insn 11, we do not see MEM being equivalent to pseudo 123,
> so we don't do the replacement.
> 
> However, if we look closer, we have MEM is equiv to SWAP(125), which is
> the same as SWAP(SWAP(123)), which is 123, since SWAP is an involutory
> operation!  CSE doesn't see this, because we don't have the right
> equivalences recorded into the equivalence table.
> 
> My idea on how to "fix" this (which I'd like some comments on), is to check
> when we insert an equivalence between A and B into the table to is check
> whether B is an expression with an involutory operation OP(INNER_B) (where OP
> is a byte swap in our case) and if so, then I also insert an equivalence
> between INNER_B and OP(A).  If we look at how that handles the test case 
> above,
> we get:
> 
> equiv buckets after insn 7:
>     *) 125 equiv SWAP(123),
>     *) 123 equiv SWAP(125)
> 
> Now, when we scan insn 8, we add the normal equivalence MEM equiv SWAP(125),
> and we see that SWAP(125) already is equivalent to 123, so we add MEM to
> that equivalence bucket.  We then add the involutory equivalence of
> 125 equiv SWAP(MEM), which also already exists, leading us to the following
> equivalence buckets after insn 8:
> 
> equiv buckets after insn 8:
>     *) 125 equiv SWAP(123) equiv SWAP(MEM)
>     *) 123 equiv SWAP(125) equiv MEM
> 
> Now, when we scan insn 11, we see that MEM is equivalent to pseudo 123
> and we make the replacement, which leads us to optimize the function
> similarly to how BE did, which results in just a "blr" insn being generated.
> 
> So it seems like I'm done! ...unfortunately no. :-(  If I change the vectors
> from V4SI to V2DF like the test case below:
> 
> typedef struct
> {
>   __vector double vx0;
>   __vector double vx1;
> } vec_t;
> 
> vec_t
> test_big_double (__vector double a, __vector double b)
> {
>   vec_t result;
>   result.vx0 = a;
>   result.vx1 = b;
>   return result;
> }
> 
> ...then I again have problems.  In this case, it's due to exp_equiv_p() not
> noticing when an expression is equivalent to itself, causing us to not
> insert one of our new involutory equivalences into an preexisting equivalence
> bucket, but into a new equivalence bucket, which causes us to not notice
> the equivalence we need to make the replacement.  I debugged this down to
> what I think is a bug in exp_equiv_p().  It uses the test:
> 
>   if (REG_IN_TABLE(i) != REG_TICK(i))
>     return 0;
> 
> to try and determine whether a register/pseudo has just been set after we've
> already processed an earlier reference to that register/pseudo.  If so, we
> reject matching it by returning 0. It seems to me, that we should only reject
> these equivalences when we've actually seen an earlier register/pseudo 
> reference,
> meaning when REG_IN_TABLE(i) != -1.  Changing the above test to:
> 
>   if (REG_IN_TABLE (i) != -1 && REG_IN_TABLE (i) != REG_TICK (i))
>     return 0;
> 
> ...fixes that problem and again, we can optimize this new test case to just
> a "blr" insn.
> 
> There are still some cases I cannot optimize yet, let the above 2nd test case,
> but instead if passing in vectors, I pass in a struct and assign its fields
> one by one, rather than doing a struct assignment, like so:
> 
> typedef struct
> {
>   __vector double vx0;
>   __vector double vx1;
> } vec_t;
> 
> vec_t
> test_big_double (vec_t a)
> {
>   vec_t result;
>   result.vx0 = a.vx0;
>   result.vx1 = a.vx1;
>   return result;
> }
> 
> Then I get the same bad code as the original test case.  Again, if we change
> the type from double to int, I get optimal code.
> 
> Question for people, am I attacking this problem the correct way?  Do people
> see a problem with me adding extra equivalences to the equivalence table?
> If so, do you have a different suggestion on how to attack this problem?
> 
> I have attached the patch I am using below.  The "important" changes are
> to cse.c.  The change to the rs6000 files are basically just a way to telling
> cse.c that the vec_select it's seeing is a involutory byte swap.  I did
> have to make one change to rs6000_rtx_costs() to modify the cost of the
> byte swap, because a byte swap of even a reg was many many time higher
> than a simple bare mem operand.
You might wander a bit and see if/how cse handles other similar
circumstances.  For example (not (not (x))  (neg (neg x)) and (bswap
(bswap (x))


THe last would seem particularly interesting -- as a hack see if you can
generate a bswap instead of vec_select at expansion time, then see if
CSE is able to fix it up.  Or perhaps set it as a REG_EQUAL note.
Again, it's a hack, but you just want to see if CSE can see the
equivalence if it's in a more common form.  Though I'm not sure if BSWAP
is handled significantly better than an equivalence VEC_SELECT.


If it is, trace how it happens.  Similarly see if anything useful gets
recorded for the other similar operators.

There may be a simplification missing somewhere that's preventing cse
from seeing the equivalences.  It's been a long time since I worked
regularly in this code.


Jeff

Re: [RFC][PATCH. CSE. rs6000] Update CSE to handle involutory operations

Reply via email to