[Bug libgomp/113627] New: Detached tasks released without call to omp_fulfill_event

2024-01-26 Thread schuchart at icl dot utk.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113627

Bug ID: 113627
   Summary: Detached tasks released without call to
omp_fulfill_event
   Product: gcc
   Version: 13.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libgomp
  Assignee: unassigned at gcc dot gnu.org
  Reporter: schuchart at icl dot utk.edu
CC: jakub at gcc dot gnu.org
  Target Milestone: ---

Created attachment 57236
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57236=edit
Pre-processed reproducer

We saw a problem in a benchmark OpenMP application that executes a loop in
which two tasks are created per iteration. Each pair of tasks in an iteration
is chained through a dependency on an array element and the first task is being
detached. We found that the second (dependent) task is executed after the
dependee is executed even though the even has not been fulfilled.

I'm attaching the preprocessed sources of a reproducer (that's as small as I
could get, apologies if it's still too complex). If the execution is correct
the program will hang because none of the events are fulfilled. If the
execution is incorrect an assert will trigger because the second task is
executed and the array value is not set properly (it is set by an outside
entity in our benchmark before the event is released). It is important to note
that the issue occurs only with more than 64 iterations when running on a
single thread. Starting from 65 iterations the dependent task is executed
without the event being fulfilled. If OMP_NUM_THREADS is set to 2 the crossover
is 128/129 iterations.

To build the example:

$ gcc -g -O0 -fopenmp example_detach.c -o example

To run the example (will hang due to the event not being fulfilled):

$ OMP_NUM_THREADS=1 ./example -t 64

To run the example and trigger the assert because the dependent task is
executed prematurely:

$ OMP_NUM_THREADS=1 ./example -t 64


I'm running on an AMD Epyc Rome machine on a GNU/Linux system. I see this
behavior with a system-wide gcc 12.2.0 installed through spack and a gcc 13.2.0
I built myself using this configure:

$ ../configure --prefix=$INSTALLDIR --enable-languages=c --disable-multilib
--with-pic --disable-bootstrap

Please let me know if I can provide anything else.

[Bug jit/66594] jitted code should use -mtune=native

2023-05-24 Thread schuchart at icl dot utk.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66594

Joseph  changed:

   What|Removed |Added

 CC||schuchart at icl dot utk.edu

--- Comment #10 from Joseph  ---
The lack of target-specific optimizations is biting us quite a bit and manually
specifying an architecture is not really an option, unless we duplicate the
detection mechanism of GCC, which is not ideal. I am not familiar with the GCC
code base and from the discussion below it's not clear what would be needed to
advance this. If someone could provide some hints on what is missing and
how/where it could be implemented we could probably take a stab at it. 

Would it be sufficient to add a macro to the header of the targets (as
suggested here https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66594#c6) that
provide host_detect_local_cpu and ignore the ones that do not provide it? Or
would it be better to hard-code calls for the architectures that provide them,
like in the referenced patch but with architecture-specific pre-processor
guards? We mostly care about i386 and arm/aarch64 but covering all available
bases would be necessary, I guess.

[Bug target/55690] On some targets thread_fence is not a compiler barrier when memmodel != MEMMODEL_SEQ_CST

2022-03-14 Thread schuchart at icl dot utk.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55690

--- Comment #2 from Joseph  ---
Created attachment 52626
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52626=edit
Reproducer

I created a reproducer (see attached file or online:
https://godbolt.org/z/n76K3Ejds). 

Note that the acquire fence does not prevent GCC 7 from loading l->b ahead of
the loop. With GCC 8 and later l->b is loaded inside the loop (as it should
be).