Re: [PR tree-optimization/52558]: RFC: questions on store data race

Aldy Hernandez Tue, 24 Apr 2012 10:43:42 -0700

On 04/13/12 03:46, Richard Guenther wrote:

On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez<al...@redhat.com>  wrote:

Richard. Thanks so much for reviewing and providing an alternativeapproach, which AFAICT provides superior results.

A similar effect could be obtained by keeping a flag whether we entered the
path that stored the value before the transform.  Thus, do

   lsm = g2;  // technically not needed, if you also handle loads
   flag = false;
   for (...)
     {
        if (g1)
          {
             if (flag)
               g2 = lsm;
          }
        else
          {
             lsm = 0;
             flag = true;
          }
     }
   if (flag)
     g2 = lsm;

I have implemented this in the attached patch, and it seems to begenerating better code than my other if-changed approach. It alsoavoids generating unnecessary loads that may trap.


For the original testcase:

int g_1 = 1;
int g_2 = 0;

int func_1(void)
{
 int l;
 for (l = 0; l < 1234; l++)
 {
   if (g_1)
     return l;
   else
     g_2 = 0;
 }
 return 999;
}

After all optimizations are done, we are now generating the followingfor the flags approach:


new:
func_1:
        movl    g_1(%rip), %edx
        xorl    %eax, %eax
        testl   %edx, %edx
        je      .L5
        rep
        ret
.L5:
        movl    $0, g_2(%rip)
        movw    $999, %ax
        ret

Which is significantly better than the unmodified trunk of:

func_1:
        movl    g_1(%rip), %edx
        movl    g_2(%rip), %eax
        testl   %edx, %edx
        je      .L5
        movl    %eax, g_2(%rip)
        xorl    %eax, %eax
        ret
.L5:
        movl    $0, g_2(%rip)
        movl    $999, %eax
        ret

And in other less contrived cases, we generate correct code that avoidswrites on code paths that did not have it. For example, in Hansregister promotion example:


int count;

struct obj {
    int data;
    struct obj *next;
} *q;

void func()
{
  struct obj *p;
  for (p = q; p; p = p->next)
        if (p->data > 0)
                count++;
}

Under the new memory model we should avoid promoting "count" to aregister and unilaterally writing to it upon exiting the loop.


With the attached patch, we now generate:

func:
.LFB0:
        movq    q(%rip), %rax
        xorl    %ecx, %ecx
        movl    count(%rip), %edx
        testq   %rax, %rax
        je      .L1
.L9:
        movl    (%rax), %esi
        testl   %esi, %esi
        jle     .L3
        addl    $1, %edx
        movl    $1, %ecx
.L3:
        movq    8(%rax), %rax
        testq   %rax, %rax
        jne     .L9
        testb   %cl, %cl
        jne     .L12
.L1:
        rep
        ret
.L12:
        movl    %edx, count(%rip)
        ret

Which is as expected.

I don't understand your suggestion of:

>    lsm = g2;  // technically not needed, if you also handle loads

that would allow to extend this to cover the cases where the access may
eventually trap (if you omit the load before the loop).


Can't I just remove the lsm=g2?  What's this about also handling loads?


Not sure which is more efficient (you can eventually combine flag variables
for different store locations, combining the if-changed compare is not so easy
I guess).


So far, I see better code generated with the flags approach, so...

1. No pass can figure out the equality (or inequality) of g_2_lsm and g_2.
  In the PR, Richard mentions that both FRE/PRE can figure this out, but they
are not run after store motion.  DOM should also be able to, but apparently
does not :(.  Tips?


Well.  Schedule FRE after loop opts ...

DOM is not prepared to handle loads/stores with differing VOPs - it was never
updated really after the alias-improvements merge.

2. The GIMPLE_CONDs I create in this patch are currently causing problems
with complex floats/doubles.  do_jump is complaining that it can't compare
them, so I assume it is too late (in tree-ssa-loop-im.c) to compare complex
values since complex lowering has already happened (??). Is there a more
canonical way of creating a GIMPLE_COND that may contain complex floats at
this stage?


As you are handling redundant stores you want a bitwise comparison anyway,
but yes, complex compares need to be lowered.  But as said, you want a
bitwise comparison, not a value-comparison.  You can try using

I can ignore all this. I have implemented both alternatives, with atarget hook as suggested, but I'm thinking of removing it altogether andjust leaving the flags approach.

Few comments on the patch itself.

+         new_bb = split_edge (ex);
+         then_bb = create_empty_bb (new_bb);
+         if (current_loops&&  new_bb->loop_father)
+           add_bb_to_loop (then_bb, new_bb->loop_father);

+         gsi = gsi_start_bb (new_bb);
+         t1 = make_rename_temp (TREE_TYPE (ref->mem), NULL);

Hmm, why do you need to re-load the value?  Don't you have both
the value as it was loaded before the loop and the new value (the lsm tmp reg)
already?  Why not simply remember the SSA name used?  Ah, because
store motion also uses make_rename_temp.  If you don't want to change
that (which would be good but inconvenient because of the required PHI
insertions), I'd change the code that performs the load in the pre-header to do

   SSA_NAME = ref->mem;
   rename-lsm-tmp = SSA_NAME;

so you can remember SSA_NAME and re-use it here.  That probably solves
your optimization issue, too.


Fixed.

+         for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi);
gsi_next (&gsi))
+           {
+             gimple phi = gsi_stmt (gsi);
+             unsigned i;
+
+             for (i = 0; i<  gimple_phi_num_args (phi); i++)
+               if (gimple_phi_arg_edge (phi, i)->src == new_bb)
+                 {
+                   tree arg = gimple_phi_arg_def (phi, i);
+                   add_phi_arg (phi, arg, then_old_edge, UNKNOWN_LOCATION);
+                   update_stmt (phi);
+                 }
+           }

Hmm.  This is of course doing unnecessary work in case there are multiple
moved stores.  In fact splitting the exit edge is only necessary if there are
more than one predecessor.  Otherwise you can simply split the exit block
after inserting the new conditional after its labels.

Is this still the case with the current patch? If so, I'm a bitconfused as to what you want here.


+  if (opts->x_flag_tm)
+    {
+      if (opts->x_flag_non_call_exceptions)
+       sorry ("transactional memory is not supported with non-call
exceptions");
+
+      set_param_value ("allow-load-data-races", 0,
+                      opts->x_param_values, opts_set->x_param_values);
+      set_param_value ("allow-store-data-races", 0,
+                      opts->x_param_values, opts_set->x_param_values);

Unless the user explicitely set those params?  Which means using params
wasn't a very good idea ... ideally you'd diagnose an error when the user
would mix -fgnu-tm and -fallow-store-data-races.  So - can we remove
allow-load-data-races (we are never going to implement that) and make
allow-store-data-races a proper -f flag please?

Sorry, that was not meant for public consumption. I have rewritten thisto either take the param value as is, and in the case of flag_tm, onlyrestrict the optimization if the loop is inside an actual transaction(see the changes to trans-mem.c, cause I know you'll hatebb_in_transaction() :-)).


+    }

And that should be a separate patch

I can certainly reimplement --param=allow-{load,store}-data-races asproper -f flags. I don't care either way. But I remember Ian Taylorhaving a different opinion. If y'all agree, I can implement whatever.

It would be nice if you could think about LIM/SM for possibly trapping
loads/stores
while you are doing this work.

We are no longer adding additional trapping loads (as with theif-changed approach). Is this something else you'd like me toconcentrate on?

Please take a look at the attached patch. I tested the flagsalternative on a full bootstrap, and there's only one regression which Iwill look at, but I'd like to know if the current WIP is aligned withwhat you had in mind.


Thanks again.
Aldy

        * trans-mem.c (bb_in_transaction_1): New.
        (bb_in_transaction): New.
        * gimple.h (bb_in_transaction): Protoize.
        * Makefile.in (tree-ssa-loop-im.o): Depend on TARGET_H.
        * doc/tm.texi: Regenerate.
        * tree-ssa-loop-im.c (execute_sm_if_changed): New.
        (execute_sm_if_changed_flag): New.
        (execute_sm_if_changed_flag_set): New.
        (execute_sm): Do not generate data races unless requested.
        * targhooks.c (default_can_compare_bitwise_p): New.
        * targhooks.h (default_can_compare_bitwise_p): Protoize.
        * target.def (can_compare_bitwise_p): New hook.
        * tree-ssa-loop-im.c (execute_sm): Guard store motion with a
        conditional.
        * opts.c (finish_options): Do not allow store or load data races
        with -fgnu-tm.

Index: trans-mem.c
===================================================================
--- trans-mem.c (revision 186108)
+++ trans-mem.c (working copy)
@@ -135,6 +135,47 @@
 */
 
 
+/* Helper function for bb_in_transaction.  Returns true if the first
+   statement in BB is in a transaction.  If the BB does not have any
+   statements, return -1.  */
+
+static int
+bb_in_transaction_1 (basic_block bb)
+{
+  gimple t;
+
+  t = gimple_seq_first_stmt (bb_seq (bb));
+  if (t)
+    return gimple_in_transaction (t);
+  return -1;
+}
+
+/* Return true if BB is in a transaction.  This can only be called
+   after transaction bits have been calculated with
+   compute_transaction_bits().  */
+
+bool
+bb_in_transaction (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+  int res;
+
+  res = bb_in_transaction_1 (bb);
+  if (res != -1)
+    return (bool) res;
+
+  /* ?? Perhaps we should cache this somewhere in the BB, or are
+     multiple levels of empty BB's common. ??  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      int res = bb_in_transaction_1 (e->src);
+      if (res != -1)
+       return (bool) res;
+    }
+  return false;
+}
+
 /* Return the attributes we want to examine for X, or NULL if it's not
    something we examine.  We look at function types, but allow pointers
    to function types and function decls and peek through.  */
Index: tree-ssa-loop-im.c
===================================================================
--- tree-ssa-loop-im.c  (revision 186108)
+++ tree-ssa-loop-im.c  (working copy)
@@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.
 #include "tree-affine.h"
 #include "pointer-set.h"
 #include "tree-ssa-propagate.h"
+#include "target.h"
 
 /* TODO:  Support for predicated code motion.  I.e.
 
@@ -52,7 +53,7 @@ along with GCC; see the file COPYING3.
         }
      }
 
-   Where COND and INV are is invariants, but evaluating INV may trap or be
+   Where COND and INV are invariants, but evaluating INV may trap or be
    invalid from some other reason if !COND.  This may be transformed to
 
    if (cond)
@@ -1956,6 +1957,161 @@ get_lsm_tmp_name (tree ref, unsigned n)
   return lsm_tmp_name;
 }
 
+/* Helper function for execute_sm.  Emit code to store TMP_VAR into
+   MEM along edge EX.
+
+   The store is only done if MEM has changed.  We do this so no
+   changes to MEM occur on code paths that did not originally store
+   into it.
+
+   The common case for execute_sm will transform:
+
+     for (...) {
+       if (foo)
+         stuff;
+       else
+         MEM = TMP_VAR;
+     }
+
+   into:
+
+     lsm = MEM;
+     for (...) {
+       if (foo)
+         stuff;
+       else
+         lsm = TMP_VAR;
+     }
+     MEM = lsm;
+
+  This function will either generate when FLAG is present:
+
+     lsm = MEM;
+     flag = false;
+     ...
+     if (foo)
+       stuff;
+     else {
+       lsm = TMP_VAR;
+       flag = true;
+     }
+     if (flag)         <--
+       MEM = lsm;      <--
+
+  or the following when FLAG is not set:
+
+     lsm = MEM;
+     for (...) {
+       if (foo)
+         stuff;
+       else
+         lsm = TMP_VAR;
+     }
+     if (lsm != MEM)   <--
+       MEM = lsm;      <--
+*/
+
+static void
+execute_sm_if_changed (edge ex, tree mem, tree orig_mem_ssa, tree tmp_var,
+                      tree flag)
+{
+  basic_block new_bb, then_bb, old_dest = ex->dest;
+  bool single_exit_p = single_pred_p (ex->dest);
+  edge then_old_edge;
+  gimple_stmt_iterator gsi;
+  gimple stmt;
+
+  if (single_exit_p)
+    ex = split_block_after_labels (old_dest);
+
+  old_dest = ex->dest;
+  new_bb = split_edge (ex);
+  then_bb = create_empty_bb (new_bb);
+  if (current_loops && new_bb->loop_father)
+    add_bb_to_loop (then_bb, new_bb->loop_father);
+
+  gsi = gsi_start_bb (new_bb);
+  if (flag)
+    stmt = gimple_build_cond (NE_EXPR, flag, boolean_false_node,
+                             NULL_TREE, NULL_TREE);
+  else
+    stmt = gimple_build_cond (NE_EXPR, orig_mem_ssa, tmp_var,
+                             NULL_TREE, NULL_TREE);
+  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+  gsi = gsi_start_bb (then_bb);
+  /* Insert actual store.  */
+  stmt = gimple_build_assign (unshare_expr (mem), tmp_var);
+  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+  make_edge (new_bb, then_bb, EDGE_TRUE_VALUE);
+  make_edge (new_bb, old_dest, EDGE_FALSE_VALUE);
+  then_old_edge = make_edge (then_bb, old_dest, EDGE_FALLTHRU);
+  set_immediate_dominator (CDI_DOMINATORS, then_bb, new_bb);
+
+  if (!single_exit_p)
+    for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi); gsi_next (&gsi))
+      {
+       gimple phi = gsi_stmt (gsi);
+       unsigned i;
+
+       for (i = 0; i < gimple_phi_num_args (phi); i++)
+         if (gimple_phi_arg_edge (phi, i)->src == new_bb)
+           {
+             tree arg = gimple_phi_arg_def (phi, i);
+             add_phi_arg (phi, arg, then_old_edge, UNKNOWN_LOCATION);
+             update_stmt (phi);
+           }
+      }
+  /* Remove the original fall through edge.  This was the
+     single_succ_edge (new_bb).  */
+  EDGE_SUCC (new_bb, 0)->flags &= ~EDGE_FALLTHRU;
+}
+
+/* Helper function for execute_sm.  On every path that sets REF, set
+   an appropriate flag indicating the store.  */
+
+static tree
+execute_sm_if_changed_flag_set (struct loop *loop, mem_ref_p ref)
+{
+  unsigned i;
+  mem_ref_loc_p loc;
+  tree flag;
+  VEC (mem_ref_loc_p, heap) *locs = NULL;
+  char *str = get_lsm_tmp_name (ref->mem, ~0);
+
+  lsm_tmp_name_add ("_flag");
+  flag = make_rename_temp (boolean_type_node, str);
+  get_all_locs_in_loop (loop, ref, &locs);
+  FOR_EACH_VEC_ELT (mem_ref_loc_p, locs, i, loc)
+    {
+      gimple_stmt_iterator gsi;
+      gimple stmt;
+
+      gsi = gsi_start_bb (gimple_bb (loc->stmt));
+      for (; !gsi_end_p (gsi); gsi_next (&gsi))
+       if (gsi_stmt (gsi) == loc->stmt)
+         {
+           stmt = gimple_build_assign (flag, boolean_true_node);
+           gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+           break;
+         }
+    }
+  VEC_free (mem_ref_loc_p, heap, locs);
+  return flag;
+}
+
+/* Store motion types for execute_sm below.  */
+enum store_type
+  {
+    /* Traditional single-thread.  */
+    STORE_SINGLE_THREAD,
+    /* Store only if value changed.  */
+    STORE_IF_CHANGED,
+    /* Store only if flag changed.  */
+    STORE_IF_CHANGED_FLAG
+  };
+
 /* Executes store motion of memory reference REF from LOOP.
    Exits from the LOOP are stored in EXITS.  The initialization of the
    temporary variable is put to the preheader of the loop, and assignments
@@ -1964,12 +2120,13 @@ get_lsm_tmp_name (tree ref, unsigned n)
 static void
 execute_sm (struct loop *loop, VEC (edge, heap) *exits, mem_ref_p ref)
 {
-  tree tmp_var;
+  tree tmp_var, mem_ssa, store_flag;
   unsigned i;
-  gimple load, store;
+  gimple load;
   struct fmt_data fmt_data;
-  edge ex;
+  edge ex, latch_edge;
   struct lim_aux_data *lim_data;
+  enum store_type store;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
@@ -1978,6 +2135,8 @@ execute_sm (struct loop *loop, VEC (edge
       fprintf (dump_file, " from loop %d\n", loop->num);
     }
 
+  mem_ssa = create_tmp_reg (TREE_TYPE (unshare_expr (ref->mem)), NULL);
+  mem_ssa = make_ssa_name (mem_ssa, NULL);
   tmp_var = make_rename_temp (TREE_TYPE (ref->mem),
                              get_lsm_tmp_name (ref->mem, ~0));
 
@@ -1985,22 +2144,66 @@ execute_sm (struct loop *loop, VEC (edge
   fmt_data.orig_loop = loop;
   for_each_index (&ref->mem, force_move_till, &fmt_data);
 
+  if ((flag_tm && bb_in_transaction (loop_preheader_edge (loop)->src))
+      || !PARAM_VALUE (PARAM_ALLOW_STORE_DATA_RACES))
+    {
+      if (targetm.can_compare_bitwise_p (TYPE_MODE (TREE_TYPE (ref->mem))))
+       store = STORE_IF_CHANGED;
+      else
+       store = STORE_IF_CHANGED_FLAG;
+    }
+  else
+    store = STORE_SINGLE_THREAD;
+
+#if 1
+  /* ?? testing ?? */
+  store = STORE_IF_CHANGED_FLAG;
+#endif
+
+  if (store == STORE_IF_CHANGED_FLAG)
+    store_flag = execute_sm_if_changed_flag_set (loop, ref);
+
   rewrite_mem_refs (loop, ref, tmp_var);
 
-  /* Emit the load & stores.  */
-  load = gimple_build_assign (tmp_var, unshare_expr (ref->mem));
+
+  /* Emit the load code into the latch, so that we are sure it will
+     be processed after all dependencies.  */
+  latch_edge = loop_latch_edge (loop);
+  load = gimple_build_assign (mem_ssa, unshare_expr (ref->mem));
   lim_data = init_lim_data (load);
   lim_data->max_loop = loop;
   lim_data->tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);
+  load = gimple_build_assign (tmp_var, mem_ssa);
+  lim_data = init_lim_data (load);
+  lim_data->max_loop = loop;
+  lim_data->tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);
+  if (store == STORE_IF_CHANGED_FLAG)
+    {
+      load = gimple_build_assign (store_flag, boolean_false_node);
+      lim_data = init_lim_data (load);
+      lim_data->max_loop = loop;
+      lim_data->tgt_loop = loop;
+      gsi_insert_on_edge (latch_edge, load);
+    }
 
-  /* Put this into the latch, so that we are sure it will be processed after
-     all dependencies.  */
-  gsi_insert_on_edge (loop_latch_edge (loop), load);
-
+  /* Sink the store to every exit from the loop.  */
   FOR_EACH_VEC_ELT (edge, exits, i, ex)
     {
-      store = gimple_build_assign (unshare_expr (ref->mem), tmp_var);
-      gsi_insert_on_edge (ex, store);
+      if (store == STORE_SINGLE_THREAD)
+       {
+         gimple store;
+         store = gimple_build_assign (unshare_expr (ref->mem), tmp_var);
+         gsi_insert_on_edge (ex, store);
+       }
+      else if (store == STORE_IF_CHANGED)
+       execute_sm_if_changed (ex, ref->mem, mem_ssa, tmp_var, NULL);
+      else if (store == STORE_IF_CHANGED_FLAG)
+       execute_sm_if_changed (ex, ref->mem, mem_ssa, tmp_var,
+                              store_flag);
+      else
+       gcc_unreachable ();
     }
 }
 
Index: target.def
===================================================================
--- target.def  (revision 186108)
+++ target.def  (working copy)
@@ -2667,6 +2667,13 @@ DEFHOOK
  enum unwind_info_type, (void),
  default_debug_unwind_info)
 
+DEFHOOK
+(can_compare_bitwise_p,
+ "This hook should return true if the target can efficiently compare two\
+ registers in the given mode for bit equality.",
+ bool, (enum machine_mode),
+ default_can_compare_bitwise_p)
+
 DEFHOOKPOD
 (atomic_test_and_set_trueval,
  "This value should be set if the result written by\
Index: gimple.h
===================================================================
--- gimple.h    (revision 186108)
+++ gimple.h    (working copy)
@@ -1122,6 +1122,7 @@ extern tree omp_reduction_init (tree, tr
 /* In trans-mem.c.  */
 extern void diagnose_tm_safe_errors (tree);
 extern void compute_transaction_bits (void);
+extern bool bb_in_transaction (basic_block);
 
 /* In tree-nested.c.  */
 extern void lower_nested_functions (tree);
Index: Makefile.in
===================================================================
--- Makefile.in (revision 186108)
+++ Makefile.in (working copy)
@@ -2532,7 +2532,7 @@ tree-ssa-loop-im.o : tree-ssa-loop-im.c
    $(PARAMS_H) output.h $(DIAGNOSTIC_H) $(TIMEVAR_H) $(TM_H) coretypes.h \
    $(TREE_DUMP_H) $(TREE_PASS_H) $(FLAGS_H) $(BASIC_BLOCK_H) \
    pointer-set.h tree-affine.h tree-ssa-propagate.h gimple-pretty-print.h \
-   tree-pretty-print.h
+   tree-pretty-print.h $(TARGET_H)
 tree-ssa-math-opts.o : tree-ssa-math-opts.c $(CONFIG_H) $(SYSTEM_H) 
coretypes.h \
    $(TM_H) $(FLAGS_H) $(TREE_H) $(TREE_FLOW_H) $(TIMEVAR_H) \
    $(TREE_PASS_H) alloc-pool.h $(BASIC_BLOCK_H) $(TARGET_H) \
Index: doc/tm.texi
===================================================================
--- doc/tm.texi (revision 186108)
+++ doc/tm.texi (working copy)
@@ -9471,6 +9471,10 @@ how you define @code{DWARF2_FRAME_INFO}.
 @end defmac
 
 @deftypefn {Target Hook} {enum unwind_info_type} TARGET_DEBUG_UNWIND_INFO 
(void)
+
+@deftypefn {Target Hook} bool TARGET_CAN_COMPARE_BITWISE_P (enum 
@var{machine_mode})
+This hook should return true if the target can efficiently compare two 
registers in the given mode for bit equality.
+@end deftypefn
 This hook defines the mechanism that will be used for describing frame
 unwind information to the debugger.  Normally the hook will return
 @code{UI_DWARF2} if DWARF 2 debug information is enabled, and
Index: targhooks.c
===================================================================
--- targhooks.c (revision 186108)
+++ targhooks.c (working copy)
@@ -1331,6 +1331,15 @@ default_debug_unwind_info (void)
   return UI_NONE;
 }
 
+/* This hook returns true if the target can efficiently compare
+   two registers in the given mode for bit equality.  */
+
+bool
+default_can_compare_bitwise_p (enum machine_mode mode)
+{
+  return SCALAR_INT_MODE_P (mode);
+}
+
 /* To be used by targets where reg_raw_mode doesn't return the right
    mode for registers used in apply_builtin_return and apply_builtin_arg.  */
 
Index: targhooks.h
===================================================================
--- targhooks.h (revision 186108)
+++ targhooks.h (working copy)
@@ -168,6 +168,8 @@ extern unsigned char default_class_max_n
 
 extern enum unwind_info_type default_debug_unwind_info (void);
 
+extern bool default_can_compare_bitwise_p (enum machine_mode);
+
 extern int default_label_align_after_barrier_max_skip (rtx);
 extern int default_loop_align_max_skip (rtx);
 extern int default_label_align_max_skip (rtx);

Re: [PR tree-optimization/52558]: RFC: questions on store data race

Reply via email to