On Sat, Jun 27, 2020 at 4:52 PM Marc Glisse <marc.gli...@inria.fr> wrote:
>
> On Fri, 26 Jun 2020, Jeff Law via Gcc-patches wrote:
>
> > In theory yes, but there are cases where paths converge (like you've shown) 
> > where
> > you may have evaluated to a constant on the paths, but it's not a constant 
> > at the
> > convergence point.  You have to be very careful using b_c_p like this and 
> > it's
> > been a regular source of kernel bugs.
> >
> >
> > I'd recommend looking at the .ssa dump and walk forward from there if the 
> > .ssa
> > dump looks correct.
>
> Here is the last dump before thread1 (105t.mergephi2). I don't see
> anything incorrect in it.
>
> ledtrig_cpu (_Bool is_active)
> {
>    int old;
>    int iftmp.0_1;
>    int _5;
>
>    <bb 2> [local count: 1073741824]:
>    if (is_active_2(D) != 0)
>      goto <bb 4>; [50.00%]
>    else
>      goto <bb 3>; [50.00%]
>
>    <bb 3> [local count: 536870913]:
>
>    <bb 4> [local count: 1073741824]:
>    # iftmp.0_1 = PHI <1(2), -1(3)>
>    _5 = __builtin_constant_p (iftmp.0_1);
>    if (_5 != 0)
>      goto <bb 5>; [50.00%]
>    else
>      goto <bb 8>; [50.00%]
>
>    <bb 5> [local count: 536870913]:
>    if (iftmp.0_1 >= -128)
>      goto <bb 6>; [50.00%]
>    else
>      goto <bb 8>; [50.00%]
>
>    <bb 6> [local count: 268435456]:
>    if (iftmp.0_1 <= 127)
>      goto <bb 7>; [34.00%]
>    else
>      goto <bb 8>; [66.00%]
>
>    <bb 7> [local count: 91268056]:
>    __asm__ __volatile__("asi %0,%1
> " : "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "i" iftmp.0_1, "Q" 
> MEM[(int *)&num_active_cpus] : "memory", "cc");
>    goto <bb 9>; [100.00%]
>
>    <bb 8> [local count: 982473769]:
>    __asm__ __volatile__("laa %0,%2,%1
> " : "old" "=d" old_8, "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "d" 
> iftmp.0_1, "Q" MEM[(int *)&num_active_cpus] : "memory", "cc");
>
>    <bb 9> [local count: 1073741824]:
>    return;
>
> }
>
> There is a single _b_c_p, the immediate asm argument is exactly the
> argument of _b_c_p, and it is in the branch protected by _b_c_p.
>
> Now the thread1 dump, for comparison
>
> ledtrig_cpu (_Bool is_active)
> {
>    int old;
>    int iftmp.0_4;
>    int iftmp.0_6;
>    int _7;
>    int _12;
>    int iftmp.0_13;
>    int iftmp.0_14;
>
>    <bb 2> [local count: 1073741824]:
>    if (is_active_2(D) != 0)
>      goto <bb 3>; [50.00%]
>    else
>      goto <bb 4>; [50.00%]
>
>    <bb 3> [local count: 536870912]:
>    # iftmp.0_6 = PHI <1(2)>
>    _7 = __builtin_constant_p (iftmp.0_6);
>    if (_7 != 0)
>      goto <bb 6>; [50.00%]
>    else
>      goto <bb 8>; [50.00%]
>
>    <bb 4> [local count: 536870912]:
>    # iftmp.0_4 = PHI <-1(2)>
>    _12 = __builtin_constant_p (iftmp.0_4);
>    if (_12 != 0)
>      goto <bb 5>; [50.00%]
>    else
>      goto <bb 8>; [50.00%]
>
>    <bb 5> [local count: 268435456]:
>    if (iftmp.0_4 >= -128)
>      goto <bb 7>; [20.00%]
>    else
>      goto <bb 8>; [80.00%]
>
>    <bb 6> [local count: 214748364]:
>    if (iftmp.0_6 <= 127)
>      goto <bb 7>; [12.00%]
>    else
>      goto <bb 8>; [88.00%]
>
>    <bb 7> [local count: 91268056]:
>    # iftmp.0_13 = PHI <iftmp.0_6(6), iftmp.0_4(5)>
>    __asm__ __volatile__("asi %0,%1
> " : "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "i" iftmp.0_13, "Q" 
> MEM[(int *)&num_active_cpus] : "memory", "cc");
>    goto <bb 9>; [100.00%]
>
>    <bb 8> [local count: 982473769]:
>    # iftmp.0_14 = PHI <iftmp.0_4(5), iftmp.0_4(4), iftmp.0_6(6), iftmp.0_6(3)>
>    __asm__ __volatile__("laa %0,%2,%1
> " : "old" "=d" old_8, "ptr" "=Q" MEM[(int *)&num_active_cpus] : "val" "d" 
> iftmp.0_14, "Q" MEM[(int *)&num_active_cpus] : "memory", "cc");
>
>    <bb 9> [local count: 1073741824]:
>    return;
>
> }
>
> Thread1 decides to separate the paths is_active and !is_active
> (surprisingly, for one it optimizes out the comparison <= 127 and for the
> other the comparison >= -128, while it could optimize both in both cases).
> And it decides to converge after the comparisons, but before the asm.
>
> What the pass did does seem to hurt. It looks like if we duplicate _b_c_p,
> we may need to duplicate far enough to include all the blocks dominated by
> _b_c_p==true (including the asm, here). Otherwise, any _b_c_p can be
> optimized to true, because for a boolean
>
> b is the same as b ? true : false
> __builtin_constant_p(b ? true : false) would be the same as b ?
> __builtin_constant_p(true) : __builtin_constant_p(false), i.e. true.
>
> It is too bad we don't have any optimization pass using ranges between IPA
> and thread1, that would have gotten rid of the comparisons, and hence the
> temptation to thread. Adding always_inline on atomic_add (or flatten on
> the caller) does help: EVRP removes the comparisons.
>
> Do you see a way forward without changing what thread1 does or declaring
> the testcase as unsupported?

Most of the cases I've seen involve transforms that make _b_c_p constant
on one path and then introduce a new PHI merging two _b_c_p values
to be then tested in a not simplified condition.  I'm not sure how to fend
off jump threading (yeah, it's nearly always jump threading doing this...)
doing this but certainly the easiest way would be to simply disallow
[jump threading] from duplicating _b_c_p calls.

Or fold _b_c_p even earlier (though I definitely saw early backwards
jump threading mess up such a case).

Richard.

> --
> Marc Glisse

Reply via email to