Ping.

> On 30 Nov 2017, at 11:03, Alan Hayward <alan.hayw...@arm.com> wrote:
> 
> 
>> On 27 Nov 2017, at 17:29, Jeff Law <l...@redhat.com> wrote:
>> 
>> On 11/23/2017 04:11 AM, Alan Hayward wrote:
>>> 
>>>> On 22 Nov 2017, at 17:33, Jeff Law <l...@redhat.com> wrote:
>>>> 
>>>> On 11/22/2017 04:31 AM, Alan Hayward wrote:
>>>>> 
>>>>>> On 21 Nov 2017, at 03:13, Jeff Law <l...@redhat.com> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  I'd
>>>>>>>> totally forgotten about it.  And in fact it seems to come pretty close
>>>>>>>> to what you need…
>>>>>>> 
>>>>>>> Yes, some of the code is similar to the way
>>>>>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the
>>>>>>> CLOBBER expr code served as a starting point for writing the patch. The 
>>>>>>> main difference
>>>>>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a 
>>>>>>> specific Instruction,
>>>>>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied 
>>>>>>> to an expression (tls_desc).
>>>>>>> It meant there wasn’t really any opportunity to resume any existing 
>>>>>>> code.
>>>>>> Understood.  Though your first patch mentions that you're trying to
>>>>>> describe partial preservation "around TLS calls". Presumably those are
>>>>>> represented as normal insns, not call_insn.
>>>>>> 
>>>>>> That brings me back to Richi's idea of exposing a set of the low subreg
>>>>>> to itself using whatever mode is wide enough to cover the neon part of
>>>>>> the register.
>>>>>> 
>>>>>> That should tell the generic parts of the compiler that you're just
>>>>>> clobbering the upper part and at least in theory you can implement in
>>>>>> the aarch64 backend and the rest of the compiler should "just work"
>>>>>> because that's the existing semantics of a subreg store.
>>>>>> 
>>>>>> The only worry would be if a pass tried to get overly smart and
>>>>>> considered that kind of set a nop -- but I think I'd argue that's simply
>>>>>> wrong given the semantics of a partial store.
>>>>>> 
>>>>> 
>>>>> So, the instead of using clobber_high(reg X), to use set(reg X, reg X).
>>>>> It’s something we considered, and then dismissed.
>>>>> 
>>>>> The problem then is you are now using SET semantics on those registers, 
>>>>> and it
>>>>> would make the register live around the function, which might not be the 
>>>>> case.
>>>>> Whereas clobber semantics will just make the register dead - which is 
>>>>> exactly
>>>>> what we want (but only conditionally).
>>>> ?!?  A set of the subreg is the *exact* semantics you want.  It says the
>>>> low part is preserved while the upper part is clobbered across the TLS
>>>> insns.
>>>> 
>>>> jeff
>>> 
>>> Consider where the TLS call is inside a loop. The compiler would normally 
>>> want
>>> to hoist that out of the loop. By adding a set(x,x) into the parallel of 
>>> the tls_desc we
>>> are now making x live across the loop, x is dependant on the value from the 
>>> previous
>>> iteration, and the tls_desc can no longer be hoisted.
>> Hmm.  I think I see the problem you're trying to point out.  Let me
>> restate it and see if you agree.
>> 
>> The low subreg set does clearly indicate the upper part of the SVE
>> register is clobbered.  The problem is from a liveness standpoint the
>> compiler is considering the low part live, even though it's a self-set.
>> 
> 
> Yes.
> 
>> In fact, if that is the case, then a single TLS call (independent of a
>> loop) would make the low part of the register globally live.  This
>> should be testable.  Include one of these low part self sets on the
>> existing TLS calls and compile a little test function and let's look at
>> the liveness data.
>> 
> 
> 
> Ok, so I’ve put 
> (set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM))
> (set (reg:TI V1_REGNUM) (reg:TI V1_REGNUM))
> etc up to V31 in the UNSPEC_TLSDESC pattern in aarch64.md, using up all the 
> neon registers.
> 
> In an ideal world, this change would do nothing as neon reg size is TI.
> 
> First I compiled up  sve_tls_preserve_1.c (Test1 below)
> 
> Without the SET changes, gcc does not backup any neon registers before the 
> tlsdesccall (good).
> With the SET changes, gcc backs up all the neon registers (not ideal).
> 
> Without the SET changes, dump logs give:
> 
> cse1:
> ;;  regs ever live     0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 66 [cc]
> 
> cprop1:
> ;; lr  in      29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
> ;; lr  use     29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
> ;; lr  def     0 [x0] 30 [x30] 32 [v0] 66 [cc] 79 80 81 82 84 85 86 87 88 89
> ;; live  in    29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
> ;; live  gen   0 [x0] 32 [v0] 78 79 80 81 82 83 84 85 86 87 88 89
> ;; live  kill  30 [x30] 66 [cc]
> ;; lr  out     29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
> ;; live  out   29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
> 
> reload:
> ;;  regs ever live     0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 
> 34 [v2] 35 [v3] 36 [v4] 66 [cc]
> 
> With the set changes:
> 
> cse1:
> ;;  regs ever live     0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 
> [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 
> [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 
> 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 
> [v30] 63 [v31] 66 [cc]
> 
> cprop1:
> ;; lr  in      29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 
> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 
> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 
> 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 
> [v31] 64 [sfp] 65 [ap]
> ;; lr  use     29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 
> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 
> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 
> 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 
> [v31] 64 [sfp] 65 [ap]
> ;; lr  def     0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 
> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 
> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 
> 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 
> [v31] 66 [cc] 79 80 81 82 84 85 86 87 88 89
> ;; live  in    29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 
> [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap]
> ;; live  gen   0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 
> 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 
> 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 
> [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 78 79 80 
> 81 82 83 84 85 86 87 88 89
> ;; live  kill  30 [x30] 66 [cc]
> ;; lr  out     29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
> ;; live  out   29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
> 
> reload:
> ;;  regs ever live     0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 
> 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 
> [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 
> 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 
> [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
> 
> 
> Is there any extra dump data you think would be useful?
> 
> 
> For completeness, I also ran through some more tests…..
> 
> 
> 
> Second I tried putting the tls call inside a simple for loop (Test2 below)
> 
> Both before and after the SET changes, gcc hoisted the tlsdesccall call out 
> of the loop.
> This has already happened before the expand pass.
> (and as before, with the SET changes, gcc backs up all the neon registers 
> before the tlsdesccall).
> 
> 
> Third, I made the tls variable an array, and accessed elements from it in the 
> loop. (Test3 below)
> 
> Both before and after the SET changes, gcc fails to hoist the tlsdesccall 
> call.
> That’s a missed optimisation chance - I’m not sure how tricky or worthwhile 
> that’d be to fix (at the
> tree level the tls load looks just like a MEM base plus index access. I guess 
> you’d need something
> specific in an rtl pass to be the hoist).
> Anyway, we now have a tlsdec inside a loop, which is what I wanted.
> Without the SET changes, nothing is backed up whilst inside the loop.
> With the SET changes, gcc tries to back up the live neon registers, and ICEs 
> with
> "error: unable to find a register to spill” in lra-assigns.c.
> 
> 
> Forth, I took the original test and made two tls accesses in a row with an 
> asm volatile in-between (Test4 below).
> 
> With the SET changes, gcc ICEs in the same way.
> Looking at the dump files,
> 
> Without the SET changes, dump logs for the final test are similar to the very 
> first test:
> 
> ;;  regs ever live     0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 
> 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 
> [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 
> 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 
> [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
> 
> ;; lr  in      29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
> ;; lr  use     29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
> ;; lr  def     0 [x0] 30 [x30] 32 [v0] 66 [cc] 81 84 85 86 88 89 90 91 95 96 
> 97 98 99 100
> ;; live  in    29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
> ;; live  gen   0 [x0] 32 [v0] 81 84 85 86 88 89 90 91 92 95 96 97 98 99 100
> ;; live  kill  30 [x30] 66 [cc]
> ;; lr  out     29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
> ;; live  out   29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
> 
> 
> With the SET changes:
> 
> ;;  regs ever live     0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 31 [sp] 32 [v0] 
> 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 66 [cc]
> 
> ;; lr  in      29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 
> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 
> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 
> 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 
> [v31] 64 [sfp] 65 [ap]
> ;; lr  use     29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 
> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 
> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 
> 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 
> [v31] 64 [sfp] 65 [ap]
> ;; lr  def     0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 
> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 
> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 
> 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 
> [v31] 66 [cc] 81 84 85 86 88 89 90 91 95 96 97 98 99 100
> ;; live  in    29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 
> [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap]
> ;; live  gen   0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 
> 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 
> 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 
> [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 81 84 85 
> 86 88 89 90 91 92 95 96 97 98 99 100
> ;; live  kill  30 [x30] 66 [cc]
> ;; lr  out     29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
> ;; live  out   29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
> 
> 
> 
> 
> Test1:
> 
> __thread v4si tx;
> 
> v4si foo (v4si a, v4si b, v4si c)
> {
>  v4si y;
> 
>  y = a + tx + b + c;
> 
>  return y + 7;
> }
> 
> 
> Test2:
> 
> 
> __thread v4si tx;
> 
> v4si foo (v4si a, v4si *b, v4si *c)
> {
>  v4si y;
> 
>  for(int i=0; i<MAX; i++)
>    y += a + tx + b[i] + c[i];
> 
>  return y + 7;
> }
> 
> 
> Test3:
> 
> __thread v4si tx[MAX];
> 
> v4si foo (v4si a, v4si *b, v4si *c)
> {
>  v4si y;
> 
>  for(int i=0; i<MAX; i++)
>    y += a + tx[i] + b[i] + c[i];
> 
>  return y + 7;
> }
> 
> 
> Test4:
> 
> __thread v4si tx;
> __thread v4si ty;
> 
> v4si foo (v4si a, v4si b, v4si c)
> {
>  v4si y;
> 
>  y = a + tx + b + c;
> 
>  asm volatile ("" : : : "memory");
> 
>  y += a + ty + b + c;
> 
>  return y + 7;
> }
> 
> 
>> 
>> Now it could be the case that various local analysis could sub-optimally
>> handle things.  You mention LICM.  I know our original LICM did have a
>> problem in that if it saw a use of a hard reg in a loop without seeing a
>> set of that hard reg it considered the register varying within the loop.
>> I have no idea if we carried that forward when the loop code was
>> rewritten (when I looked at this it was circa 1992).
>> 
>> 
>>> 
>>> Or consider a stream of code containing two tls_desc calls (ok, the 
>>> compiler might
>>> optimise one of the tls calls away, but this approach should be reusable 
>>> for other exprs).
>>> Between the two set(x,x)’s x is considered live so the register allocator 
>>> can’t use that
>>> register.
>>> Given that we are applying this to all the neon registers, the register 
>>> allocator now throws
>>> an ICE because it can’t find any free hard neon registers to use.
>> Given your statements it sounds like the liveness infrastructure is
>> making those neon regs globally live when it sees the low part subreg
>> self-set.  Let's confirm that one way or the other and see where it
>> takes us.
>> 
>> Jeff
> 

Reply via email to