https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95654

--- Comment #13 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Tom de Vries from comment #11)
> My guess at this point, is that duplicating the block with VOTE_ANY has the
> effect that the JIT compiler doesn't recognize control flow divergence
> before XCHG_IDX, and fails to insert the proper barrier.

Turns out, it's not that complicated.

Before ftracer we have:
...
  <bb 4> [local count: 268435456]:
  _30 = _18 + _27;
  _31 = _18 + _28;
  _46 = .GOMP_SIMT_ENTER_ALLOC (0, 1);
  _47 = .GOMP_SIMT_LANE ();
  _48 = (int) _47;
  _49 = _30 + _48;
  if (_31 > _49)
    goto <bb 8>; [87.50%]
  else
    goto <bb 5>; [12.50%]

  <bb 8> [local count: 117440512]:
  ...
  goto <bb 5>; [100.00%]

  <bb 5> [local count: 134217728]:
  # _54 = PHI <_50(D)(4), _67(8)>
  # _34 = PHI <_49(4), _71(8)>
  _55 = _34 == 63;
  _56 = (int) _55;
  _57 = .GOMP_SIMT_VOTE_ANY (_56);
  if (_57 != 0)
    goto <bb 7>; [50.00%]
  else
    goto <bb 6>; [50.00%]

  <bb 7> [local count: 67108864]:
  _58 = .GOMP_SIMT_LAST_LANE (_56);
  _60 = .GOMP_SIMT_XCHG_IDX (_54, _58);
  _61 = _60 + 1;
  goto <bb 6>; [100.00%]

  <bb 6> [local count: 268435456]:
  # d1_6 = PHI <_61(7), d1_29(D)(5)>
  *_46 ={v} {CLOBBER};
  .GOMP_SIMT_EXIT (_46);
  if (_31 == 32)
    goto <bb 11>; [34.00%]
  else
    goto <bb 9>; [66.00%]
...

At bb4 entry, we have unified control flow (that is, all threads in the warp
execute the same code in lockstep).

That's no longer the case at bb5/bb8.  In team 0, threads 0..15 execute the
loop body (bb8), and threads 16..31 don't.  In team 1, it's the opposite.

However, at bb5 the control flow from bb4 and bb8 joins, so control flow is
once again unified.

Then VOTE_ANY is executed in bb5, with team 1 subsequently going to the block
with XCHG_IDX (bb 7), and team 0, skipping straight to bb6.

After ftracer, we have:
...
  <bb 5> [local count: 16777216]:
  # _54 = PHI <_50(D)(4)>
  # _34 = PHI <_49(4)>
  _55 = _34 == 63;
  _56 = (int) _55;
  _57 = .GOMP_SIMT_VOTE_ANY (_56);
  if (_57 != 0)
    goto <bb 7>; [50.00%]
  else
    goto <bb 6>; [50.00%]

  <bb 8> [local count: 117440512]:
  ...
  _80 = _71 == 63;
  _81 = (int) _80;
  _82 = .GOMP_SIMT_VOTE_ANY (_81);
  if (_82 != 0)
    goto <bb 7>; [50.00%]
  else
    goto <bb 6>; [50.00%]
...

Now control flow no longer is unified at bb 5, and consequently it's not in bb7
when executing XCHG_IDX.  And that's the root cause for the failure we're
seeing.

So, one way to handle this it to consider VOTE_ANY as a "join" to the "fork" of
ENTER_ALLOC (which means: don't duplicate, unless you duplicate the pair).

But, after reading this:
...
/* Allocate per-lane storage and begin non-uniform execution region.  */

static void
expand_GOMP_SIMT_ENTER_ALLOC (internal_fn, gcall *stmt)
...
and this:
...
/* Deallocate per-lane storage and leave non-uniform execution region.  */

static void
expand_GOMP_SIMT_EXIT (internal_fn, gcall *stmt)
...
it seems that spot is already taken.

So I wonder, isn't the problem that we do the lastprivate stuff before
SIMT_EXIT. [ Of course after fixing that we might run into SIMT_EXIT being
duplicated by ftracer. But there at least the description of the internal-fn
would make it clear why we don't want to duplicate it. ]

Reply via email to