Re: Fix PR51298, libgomp barrier failure

2011-11-29 Thread Alan Modra
On Tue, Nov 29, 2011 at 04:15:36PM +1030, Alan Modra wrote:
 On Mon, Nov 28, 2011 at 05:42:15PM -0800, Richard Henderson wrote:
  On 11/28/2011 06:02 AM, Alan Modra wrote:
   -  unsigned int ret = bar-generation  ~3;
   -  /* Do we need any barrier here or is __sync_add_and_fetch acting
   - as the needed LoadLoad barrier already?  */
   -  ret += __sync_add_and_fetch (bar-awaited, -1) == 0;
   +  unsigned int ret = __atomic_load_4 (bar-generation, 
   MEMMODEL_ACQUIRE)  ~3;
   +  ret += __atomic_add_fetch (bar-awaited, -1, MEMMODEL_ACQ_REL) == 0;
  
  Given that the read from bar-generation is ACQ, we don't need a duplicate 
  barrier from the REL on the atomic add.  I believe both can be 
  MEMMODEL_ACQUIRE
  both in order to force the ordering of these two memops, as well as force 
  these
  to happen before anything subsequent.
 
 I tried with MEMMODEL_ACQUIRE and ran force-parallel-6.exe, the test
 that seems most sensitive to barrier problems, many times, and it hangs
 occasionally in futex_wait called via gomp_team_barrier_wait_end.
 
 I believe that threads can't be allowed to exit from
 gomp_{,team_}barrier_wait without hitting a release barrier, and
 perhaps from gomp_barrier_wait_last too.  gomp_barrier_wait_start is a
 convenient point to insert the barrier, and a minimal change from the
 old code using __sync_add_and_fetch.  I can add a comment. ;-)

Committed rev 181833.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Fix PR51298, libgomp barrier failure

2011-11-28 Thread Jakub Jelinek
On Mon, Nov 28, 2011 at 10:12:09AM +1030, Alan Modra wrote:
 --- libgomp/config/linux/bar.c(revision 181718)
 +++ libgomp/config/linux/bar.c(working copy)
 @@ -36,18 +36,15 @@ gomp_barrier_wait_end (gomp_barrier_t *b
if (__builtin_expect ((state  1) != 0, 0))
  {
/* Next time we'll be awaiting TOTAL threads again.  */
 -  bar-awaited = bar-total;
 -  atomic_write_barrier ();
 -  bar-generation += 4;
 +  __atomic_store_4 (bar-awaited, bar-total, MEMMODEL_RELEASE);
 +  __atomic_add_fetch (bar-generation, 4, MEMMODEL_ACQ_REL);
futex_wake ((int *) bar-generation, INT_MAX);
  }

The above is unfortunate, bar-generation should be only modified from a
single thread at this point, but the __atomic_add_fetch will enforce there
an atomic instruction on it rather than just some kind of barrier.

Jakub


Re: Fix PR51298, libgomp barrier failure

2011-11-28 Thread Alan Modra
On Mon, Nov 28, 2011 at 09:13:27AM +0100, Jakub Jelinek wrote:
 On Mon, Nov 28, 2011 at 10:12:09AM +1030, Alan Modra wrote:
  --- libgomp/config/linux/bar.c  (revision 181718)
  +++ libgomp/config/linux/bar.c  (working copy)
  @@ -36,18 +36,15 @@ gomp_barrier_wait_end (gomp_barrier_t *b
 if (__builtin_expect ((state  1) != 0, 0))
   {
 /* Next time we'll be awaiting TOTAL threads again.  */
  -  bar-awaited = bar-total;
  -  atomic_write_barrier ();
  -  bar-generation += 4;
  +  __atomic_store_4 (bar-awaited, bar-total, MEMMODEL_RELEASE);
  +  __atomic_add_fetch (bar-generation, 4, MEMMODEL_ACQ_REL);
 futex_wake ((int *) bar-generation, INT_MAX);
   }
 
 The above is unfortunate, bar-generation should be only modified from a
 single thread at this point, but the __atomic_add_fetch will enforce there
 an atomic instruction on it rather than just some kind of barrier.

I will try without the atomic add and see what happens.

-- 
Alan Modra
Australia Development Lab, IBM


Re: Fix PR51298, libgomp barrier failure

2011-11-28 Thread Alan Modra
On Mon, Nov 28, 2011 at 08:09:01PM +1030, Alan Modra wrote:
 On Mon, Nov 28, 2011 at 09:13:27AM +0100, Jakub Jelinek wrote:
  On Mon, Nov 28, 2011 at 10:12:09AM +1030, Alan Modra wrote:
   --- libgomp/config/linux/bar.c(revision 181718)
   +++ libgomp/config/linux/bar.c(working copy)
   @@ -36,18 +36,15 @@ gomp_barrier_wait_end (gomp_barrier_t *b
  if (__builtin_expect ((state  1) != 0, 0))
{
  /* Next time we'll be awaiting TOTAL threads again.  */
   -  bar-awaited = bar-total;
   -  atomic_write_barrier ();
   -  bar-generation += 4;
   +  __atomic_store_4 (bar-awaited, bar-total, MEMMODEL_RELEASE);
   +  __atomic_add_fetch (bar-generation, 4, MEMMODEL_ACQ_REL);
  futex_wake ((int *) bar-generation, INT_MAX);
}
  
  The above is unfortunate, bar-generation should be only modified from a
  single thread at this point, but the __atomic_add_fetch will enforce there
  an atomic instruction on it rather than just some kind of barrier.
 
 I will try without the atomic add and see what happens.

Seems to be fine.  I took out the extra write barriers too, so we now
just have two MEMMODEL_RELEASE atomic store barriers replacing the two
old atomic_write_barriers, and a number of new acquire barriers.

PR libgomp/51298
* config/linux/bar.h: Use atomic rather than sync builtins.
* config/linux/bar.c: Likewise.  Add missing acquire
synchronisation on generation field.
* task.c (gomp_barrier_handle_tasks): Regain lock so as to not
double unlock.

Index: libgomp/task.c
===
--- libgomp/task.c  (revision 181718)
+++ libgomp/task.c  (working copy)
@@ -273,6 +273,7 @@ gomp_barrier_handle_tasks (gomp_barrier_
  gomp_team_barrier_done (team-barrier, state);
  gomp_mutex_unlock (team-task_lock);
  gomp_team_barrier_wake (team-barrier, 0);
+ gomp_mutex_lock (team-task_lock);
}
}
 }
Index: libgomp/config/linux/bar.h
===
--- libgomp/config/linux/bar.h  (revision 181770)
+++ libgomp/config/linux/bar.h  (working copy)
@@ -50,7 +50,7 @@ static inline void gomp_barrier_init (go
 
 static inline void gomp_barrier_reinit (gomp_barrier_t *bar, unsigned count)
 {
-  __sync_fetch_and_add (bar-awaited, count - bar-total);
+  __atomic_add_fetch (bar-awaited, count - bar-total, MEMMODEL_ACQ_REL);
   bar-total = count;
 }
 
@@ -69,10 +69,8 @@ extern void gomp_team_barrier_wake (gomp
 static inline gomp_barrier_state_t
 gomp_barrier_wait_start (gomp_barrier_t *bar)
 {
-  unsigned int ret = bar-generation  ~3;
-  /* Do we need any barrier here or is __sync_add_and_fetch acting
- as the needed LoadLoad barrier already?  */
-  ret += __sync_add_and_fetch (bar-awaited, -1) == 0;
+  unsigned int ret = __atomic_load_4 (bar-generation, MEMMODEL_ACQUIRE)  ~3;
+  ret += __atomic_add_fetch (bar-awaited, -1, MEMMODEL_ACQ_REL) == 0;
   return ret;
 }
 
Index: libgomp/config/linux/bar.c
===
--- libgomp/config/linux/bar.c  (revision 181770)
+++ libgomp/config/linux/bar.c  (working copy)
@@ -37,17 +37,15 @@ gomp_barrier_wait_end (gomp_barrier_t *b
 {
   /* Next time we'll be awaiting TOTAL threads again.  */
   bar-awaited = bar-total;
-  atomic_write_barrier ();
-  bar-generation += 4;
+  __atomic_store_4 (bar-generation, bar-generation + 4,
+   MEMMODEL_RELEASE);
   futex_wake ((int *) bar-generation, INT_MAX);
 }
   else
 {
-  unsigned int generation = state;
-
   do
-   do_wait ((int *) bar-generation, generation);
-  while (bar-generation == generation);
+   do_wait ((int *) bar-generation, state);
+  while (__atomic_load_4 (bar-generation, MEMMODEL_ACQUIRE) == state);
 }
 }
 
@@ -81,15 +79,15 @@ gomp_team_barrier_wake (gomp_barrier_t *
 void
 gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
 {
-  unsigned int generation;
+  unsigned int generation, gen;
 
   if (__builtin_expect ((state  1) != 0, 0))
 {
   /* Next time we'll be awaiting TOTAL threads again.  */
   struct gomp_thread *thr = gomp_thread ();
   struct gomp_team *team = thr-ts.team;
+
   bar-awaited = bar-total;
-  atomic_write_barrier ();
   if (__builtin_expect (team-task_count, 0))
{
  gomp_barrier_handle_tasks (state);
@@ -97,7 +95,7 @@ gomp_team_barrier_wait_end (gomp_barrier
}
   else
{
- bar-generation = state + 3;
+ __atomic_store_4 (bar-generation, state + 3, MEMMODEL_RELEASE);
  futex_wake ((int *) bar-generation, INT_MAX);
  return;
}
@@ -107,12 +105,16 @@ gomp_team_barrier_wait_end (gomp_barrier
   do
 {
   do_wait ((int *) bar-generation, generation);
-  if 

Re: Fix PR51298, libgomp barrier failure

2011-11-28 Thread Richard Henderson
On 11/28/2011 06:02 AM, Alan Modra wrote:
 -  unsigned int ret = bar-generation  ~3;
 -  /* Do we need any barrier here or is __sync_add_and_fetch acting
 - as the needed LoadLoad barrier already?  */
 -  ret += __sync_add_and_fetch (bar-awaited, -1) == 0;
 +  unsigned int ret = __atomic_load_4 (bar-generation, MEMMODEL_ACQUIRE)  
 ~3;
 +  ret += __atomic_add_fetch (bar-awaited, -1, MEMMODEL_ACQ_REL) == 0;

Given that the read from bar-generation is ACQ, we don't need a duplicate 
barrier from the REL on the atomic add.  I believe both can be MEMMODEL_ACQUIRE
both in order to force the ordering of these two memops, as well as force these
to happen before anything subsequent.

The s/_4/_n/ change needs doing.

Otherwise ok.


r~


Re: Fix PR51298, libgomp barrier failure

2011-11-28 Thread Alan Modra
On Mon, Nov 28, 2011 at 05:42:15PM -0800, Richard Henderson wrote:
 On 11/28/2011 06:02 AM, Alan Modra wrote:
  -  unsigned int ret = bar-generation  ~3;
  -  /* Do we need any barrier here or is __sync_add_and_fetch acting
  - as the needed LoadLoad barrier already?  */
  -  ret += __sync_add_and_fetch (bar-awaited, -1) == 0;
  +  unsigned int ret = __atomic_load_4 (bar-generation, MEMMODEL_ACQUIRE) 
   ~3;
  +  ret += __atomic_add_fetch (bar-awaited, -1, MEMMODEL_ACQ_REL) == 0;
 
 Given that the read from bar-generation is ACQ, we don't need a duplicate 
 barrier from the REL on the atomic add.  I believe both can be 
 MEMMODEL_ACQUIRE
 both in order to force the ordering of these two memops, as well as force 
 these
 to happen before anything subsequent.

I tried with MEMMODEL_ACQUIRE and ran force-parallel-6.exe, the test
that seems most sensitive to barrier problems, many times, and it hangs
occasionally in futex_wait called via gomp_team_barrier_wait_end.

I believe that threads can't be allowed to exit from
gomp_{,team_}barrier_wait without hitting a release barrier, and
perhaps from gomp_barrier_wait_last too.  gomp_barrier_wait_start is a
convenient point to insert the barrier, and a minimal change from the
old code using __sync_add_and_fetch.  I can add a comment. ;-)

-- 
Alan Modra
Australia Development Lab, IBM


Fix PR51298, libgomp barrier failure

2011-11-27 Thread Alan Modra
This patch cures the remaining libgomp failures I see on power7.  I'll
admit to approaching this fix with a big hammer at first, liberally
using MEMMODEL_SEQ_CST barriers all over bar.h and bar.c, then
gradually reducing the number of places and strictness of the
barriers.  This may still have a few unnecessary barriers, but I'm not
confident in removing any more.

What led me to believe that barriers were missing in the current code
was the nature of the test failures in the first place, and that the
existing code uses atomic_write_barrier in two places between writing
awaited and generation, but does not have corresponding read
barriers.  Without read barriers processors are allowed to
speculatively read generation ahead of time, leading to use of a
stale value.

PR libgomp/51298
* config/linux/bar.h: Use atomic rather than sync builtins.
* config/linux/bar.c: Likewise.  Add missing acquire and release
synchronisation on generation field.
* task.c (gomp_barrier_handle_tasks): Regain lock so as to not
double unlock.

Index: libgomp/task.c
===
--- libgomp/task.c  (revision 181718)
+++ libgomp/task.c  (working copy)
@@ -273,6 +273,7 @@ gomp_barrier_handle_tasks (gomp_barrier_
  gomp_team_barrier_done (team-barrier, state);
  gomp_mutex_unlock (team-task_lock);
  gomp_team_barrier_wake (team-barrier, 0);
+ gomp_mutex_lock (team-task_lock);
}
}
 }
Index: libgomp/config/linux/bar.h
===
--- libgomp/config/linux/bar.h  (revision 181718)
+++ libgomp/config/linux/bar.h  (working copy)
@@ -50,7 +50,7 @@ static inline void gomp_barrier_init (go
 
 static inline void gomp_barrier_reinit (gomp_barrier_t *bar, unsigned count)
 {
-  __sync_fetch_and_add (bar-awaited, count - bar-total);
+  __atomic_add_fetch (bar-awaited, count - bar-total, MEMMODEL_ACQ_REL);
   bar-total = count;
 }
 
@@ -69,10 +69,8 @@ extern void gomp_team_barrier_wake (gomp
 static inline gomp_barrier_state_t
 gomp_barrier_wait_start (gomp_barrier_t *bar)
 {
-  unsigned int ret = bar-generation  ~3;
-  /* Do we need any barrier here or is __sync_add_and_fetch acting
- as the needed LoadLoad barrier already?  */
-  ret += __sync_add_and_fetch (bar-awaited, -1) == 0;
+  unsigned int ret = __atomic_load_4 (bar-generation, MEMMODEL_ACQUIRE)  ~3;
+  ret += __atomic_add_fetch (bar-awaited, -1, MEMMODEL_ACQ_REL) == 0;
   return ret;
 }
 
Index: libgomp/config/linux/bar.c
===
--- libgomp/config/linux/bar.c  (revision 181718)
+++ libgomp/config/linux/bar.c  (working copy)
@@ -36,18 +36,15 @@ gomp_barrier_wait_end (gomp_barrier_t *b
   if (__builtin_expect ((state  1) != 0, 0))
 {
   /* Next time we'll be awaiting TOTAL threads again.  */
-  bar-awaited = bar-total;
-  atomic_write_barrier ();
-  bar-generation += 4;
+  __atomic_store_4 (bar-awaited, bar-total, MEMMODEL_RELEASE);
+  __atomic_add_fetch (bar-generation, 4, MEMMODEL_ACQ_REL);
   futex_wake ((int *) bar-generation, INT_MAX);
 }
   else
 {
-  unsigned int generation = state;
-
   do
-   do_wait ((int *) bar-generation, generation);
-  while (bar-generation == generation);
+   do_wait ((int *) bar-generation, state);
+  while (__atomic_load_4 (bar-generation, MEMMODEL_ACQUIRE) == state);
 }
 }
 
@@ -81,15 +78,15 @@ gomp_team_barrier_wake (gomp_barrier_t *
 void
 gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state)
 {
-  unsigned int generation;
+  unsigned int generation, gen;
 
   if (__builtin_expect ((state  1) != 0, 0))
 {
   /* Next time we'll be awaiting TOTAL threads again.  */
   struct gomp_thread *thr = gomp_thread ();
   struct gomp_team *team = thr-ts.team;
-  bar-awaited = bar-total;
-  atomic_write_barrier ();
+
+  __atomic_store_4 (bar-awaited, bar-total, MEMMODEL_RELEASE);
   if (__builtin_expect (team-task_count, 0))
{
  gomp_barrier_handle_tasks (state);
@@ -97,7 +94,7 @@ gomp_team_barrier_wait_end (gomp_barrier
}
   else
{
- bar-generation = state + 3;
+ __atomic_store_4 (bar-generation, state + 3, MEMMODEL_RELEASE);
  futex_wake ((int *) bar-generation, INT_MAX);
  return;
}
@@ -107,12 +104,14 @@ gomp_team_barrier_wait_end (gomp_barrier
   do
 {
   do_wait ((int *) bar-generation, generation);
-  if (__builtin_expect (bar-generation  1, 0))
+  gen = __atomic_load_4 (bar-generation, MEMMODEL_ACQUIRE);
+  if (__builtin_expect (gen  1, 0))
gomp_barrier_handle_tasks (state);
-  if ((bar-generation  2))
+  gen = __atomic_load_4 (bar-generation, MEMMODEL_ACQUIRE);
+  if ((gen  2) != 0)
generation |= 2;
 }
-  while