Ping.
> On 30 Nov 2017, at 11:03, Alan Hayward <alan.hayw...@arm.com> wrote:
>
>
>> On 27 Nov 2017, at 17:29, Jeff Law <l...@redhat.com> wrote:
>>
>> On 11/23/2017 04:11 AM, Alan Hayward wrote:
>>>
>>>> On 22 Nov 2017, at 17:33, Jeff Law <l...@redhat.com> wrote:
>>>>
>>>> On 11/22/2017 04:31 AM, Alan Hayward wrote:
>>>>>
>>>>>> On 21 Nov 2017, at 03:13, Jeff Law <l...@redhat.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED. I'd
>>>>>>>> totally forgotten about it. And in fact it seems to come pretty close
>>>>>>>> to what you need…
>>>>>>>
>>>>>>> Yes, some of the code is similar to the way
>>>>>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the
>>>>>>> CLOBBER expr code served as a starting point for writing the patch. The
>>>>>>> main difference
>>>>>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a
>>>>>>> specific Instruction,
>>>>>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied
>>>>>>> to an expression (tls_desc).
>>>>>>> It meant there wasn’t really any opportunity to resume any existing
>>>>>>> code.
>>>>>> Understood. Though your first patch mentions that you're trying to
>>>>>> describe partial preservation "around TLS calls". Presumably those are
>>>>>> represented as normal insns, not call_insn.
>>>>>>
>>>>>> That brings me back to Richi's idea of exposing a set of the low subreg
>>>>>> to itself using whatever mode is wide enough to cover the neon part of
>>>>>> the register.
>>>>>>
>>>>>> That should tell the generic parts of the compiler that you're just
>>>>>> clobbering the upper part and at least in theory you can implement in
>>>>>> the aarch64 backend and the rest of the compiler should "just work"
>>>>>> because that's the existing semantics of a subreg store.
>>>>>>
>>>>>> The only worry would be if a pass tried to get overly smart and
>>>>>> considered that kind of set a nop -- but I think I'd argue that's simply
>>>>>> wrong given the semantics of a partial store.
>>>>>>
>>>>>
>>>>> So, the instead of using clobber_high(reg X), to use set(reg X, reg X).
>>>>> It’s something we considered, and then dismissed.
>>>>>
>>>>> The problem then is you are now using SET semantics on those registers,
>>>>> and it
>>>>> would make the register live around the function, which might not be the
>>>>> case.
>>>>> Whereas clobber semantics will just make the register dead - which is
>>>>> exactly
>>>>> what we want (but only conditionally).
>>>> ?!? A set of the subreg is the *exact* semantics you want. It says the
>>>> low part is preserved while the upper part is clobbered across the TLS
>>>> insns.
>>>>
>>>> jeff
>>>
>>> Consider where the TLS call is inside a loop. The compiler would normally
>>> want
>>> to hoist that out of the loop. By adding a set(x,x) into the parallel of
>>> the tls_desc we
>>> are now making x live across the loop, x is dependant on the value from the
>>> previous
>>> iteration, and the tls_desc can no longer be hoisted.
>> Hmm. I think I see the problem you're trying to point out. Let me
>> restate it and see if you agree.
>>
>> The low subreg set does clearly indicate the upper part of the SVE
>> register is clobbered. The problem is from a liveness standpoint the
>> compiler is considering the low part live, even though it's a self-set.
>>
>
> Yes.
>
>> In fact, if that is the case, then a single TLS call (independent of a
>> loop) would make the low part of the register globally live. This
>> should be testable. Include one of these low part self sets on the
>> existing TLS calls and compile a little test function and let's look at
>> the liveness data.
>>
>
>
> Ok, so I’ve put
> (set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM))
> (set (reg:TI V1_REGNUM) (reg:TI V1_REGNUM))
> etc up to V31 in the UNSPEC_TLSDESC pattern in aarch64.md, using up all the
> neon registers.
>
> In an ideal world, this change would do nothing as neon reg size is TI.
>
> First I compiled up sve_tls_preserve_1.c (Test1 below)
>
> Without the SET changes, gcc does not backup any neon registers before the
> tlsdesccall (good).
> With the SET changes, gcc backs up all the neon registers (not ideal).
>
> Without the SET changes, dump logs give:
>
> cse1:
> ;; regs ever live 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 66 [cc]
>
> cprop1:
> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
> ;; lr def 0 [x0] 30 [x30] 32 [v0] 66 [cc] 79 80 81 82 84 85 86 87 88 89
> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
> ;; live gen 0 [x0] 32 [v0] 78 79 80 81 82 83 84 85 86 87 88 89
> ;; live kill 30 [x30] 66 [cc]
> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>
> reload:
> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1]
> 34 [v2] 35 [v3] 36 [v4] 66 [cc]
>
> With the set changes:
>
> cse1:
> ;; regs ever live 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36
> [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45
> [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21]
> 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62
> [v30] 63 [v31] 66 [cc]
>
> cprop1:
> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37
> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46
> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22]
> 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63
> [v31] 64 [sfp] 65 [ap]
> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37
> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46
> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22]
> 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63
> [v31] 64 [sfp] 65 [ap]
> ;; lr def 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37
> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46
> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22]
> 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63
> [v31] 66 [cc] 79 80 81 82 84 85 86 87 88 89
> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37
> [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap]
> ;; live gen 0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6]
> 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15]
> 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56
> [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 78 79 80
> 81 82 83 84 85 86 87 88 89
> ;; live kill 30 [x30] 66 [cc]
> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>
> reload:
> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1]
> 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43
> [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19]
> 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60
> [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
>
>
> Is there any extra dump data you think would be useful?
>
>
> For completeness, I also ran through some more tests…..
>
>
>
> Second I tried putting the tls call inside a simple for loop (Test2 below)
>
> Both before and after the SET changes, gcc hoisted the tlsdesccall call out
> of the loop.
> This has already happened before the expand pass.
> (and as before, with the SET changes, gcc backs up all the neon registers
> before the tlsdesccall).
>
>
> Third, I made the tls variable an array, and accessed elements from it in the
> loop. (Test3 below)
>
> Both before and after the SET changes, gcc fails to hoist the tlsdesccall
> call.
> That’s a missed optimisation chance - I’m not sure how tricky or worthwhile
> that’d be to fix (at the
> tree level the tls load looks just like a MEM base plus index access. I guess
> you’d need something
> specific in an rtl pass to be the hoist).
> Anyway, we now have a tlsdec inside a loop, which is what I wanted.
> Without the SET changes, nothing is backed up whilst inside the loop.
> With the SET changes, gcc tries to back up the live neon registers, and ICEs
> with
> "error: unable to find a register to spill” in lra-assigns.c.
>
>
> Forth, I took the original test and made two tls accesses in a row with an
> asm volatile in-between (Test4 below).
>
> With the SET changes, gcc ICEs in the same way.
> Looking at the dump files,
>
> Without the SET changes, dump logs for the final test are similar to the very
> first test:
>
> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1]
> 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43
> [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19]
> 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60
> [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
>
> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
> ;; lr def 0 [x0] 30 [x30] 32 [v0] 66 [cc] 81 84 85 86 88 89 90 91 95 96
> 97 98 99 100
> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
> ;; live gen 0 [x0] 32 [v0] 81 84 85 86 88 89 90 91 92 95 96 97 98 99 100
> ;; live kill 30 [x30] 66 [cc]
> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>
>
> With the SET changes:
>
> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 31 [sp] 32 [v0]
> 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 66 [cc]
>
> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37
> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46
> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22]
> 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63
> [v31] 64 [sfp] 65 [ap]
> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37
> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46
> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22]
> 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63
> [v31] 64 [sfp] 65 [ap]
> ;; lr def 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37
> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46
> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22]
> 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63
> [v31] 66 [cc] 81 84 85 86 88 89 90 91 95 96 97 98 99 100
> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37
> [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap]
> ;; live gen 0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6]
> 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15]
> 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56
> [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 81 84 85
> 86 88 89 90 91 92 95 96 97 98 99 100
> ;; live kill 30 [x30] 66 [cc]
> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>
>
>
>
> Test1:
>
> __thread v4si tx;
>
> v4si foo (v4si a, v4si b, v4si c)
> {
> v4si y;
>
> y = a + tx + b + c;
>
> return y + 7;
> }
>
>
> Test2:
>
>
> __thread v4si tx;
>
> v4si foo (v4si a, v4si *b, v4si *c)
> {
> v4si y;
>
> for(int i=0; i<MAX; i++)
> y += a + tx + b[i] + c[i];
>
> return y + 7;
> }
>
>
> Test3:
>
> __thread v4si tx[MAX];
>
> v4si foo (v4si a, v4si *b, v4si *c)
> {
> v4si y;
>
> for(int i=0; i<MAX; i++)
> y += a + tx[i] + b[i] + c[i];
>
> return y + 7;
> }
>
>
> Test4:
>
> __thread v4si tx;
> __thread v4si ty;
>
> v4si foo (v4si a, v4si b, v4si c)
> {
> v4si y;
>
> y = a + tx + b + c;
>
> asm volatile ("" : : : "memory");
>
> y += a + ty + b + c;
>
> return y + 7;
> }
>
>
>>
>> Now it could be the case that various local analysis could sub-optimally
>> handle things. You mention LICM. I know our original LICM did have a
>> problem in that if it saw a use of a hard reg in a loop without seeing a
>> set of that hard reg it considered the register varying within the loop.
>> I have no idea if we carried that forward when the loop code was
>> rewritten (when I looked at this it was circa 1992).
>>
>>
>>>
>>> Or consider a stream of code containing two tls_desc calls (ok, the
>>> compiler might
>>> optimise one of the tls calls away, but this approach should be reusable
>>> for other exprs).
>>> Between the two set(x,x)’s x is considered live so the register allocator
>>> can’t use that
>>> register.
>>> Given that we are applying this to all the neon registers, the register
>>> allocator now throws
>>> an ICE because it can’t find any free hard neon registers to use.
>> Given your statements it sounds like the liveness infrastructure is
>> making those neon regs globally live when it sees the low part subreg
>> self-set. Let's confirm that one way or the other and see where it
>> takes us.
>>
>> Jeff
>