[Bug c++/82629] OpenMP 4.5 Target Region mangling problem

2017-10-27 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82629

--- Comment #4 from Thorsten Kurth  ---
Hello Richard,

Was the test case received?

Best Regards
Thorsten Kurth

[Bug c++/82629] OpenMP 4.5 Target Region mangling problem

2017-10-20 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82629

--- Comment #3 from Thorsten Kurth  ---
One more thing,

In the test case I send, please change the $(XPPFLAGS) in the main.x target
compilation to $(CXXFLAGS), so that -fopenmp is used at link time also.
However, that does not solve the problem but it makes the Makefile more correct
(the XPPFLAGS was a remnant from something I tried out earlier). Sorry for
that. 

Best Regards
Thorsten Kurth

[Bug c++/82629] OpenMP 4.5 Target Region mangling problem

2017-10-20 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82629

--- Comment #2 from Thorsten Kurth  ---
Created attachment 42420
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42420&action=edit
This is the test case demonstrating the problem.

Linking this code will produce:

-bash-4.2$ make main.x
g++ -O2 -std=c++11 -fopenmp -foffload=nvptx-none -c aclass.cpp -o aclass.o
g++ -O2 -std=c++11 -fopenmp -foffload=nvptx-none -c bclass.cpp -o bclass.o
g++  aclass.o bclass.o -o main.x 
lto1: fatal error: aclass.o: section _ZN6master4copyERKS_$_omp_fn$1 is missing
compilation terminated.
mkoffload: fatal error: powerpc64le-unknown-linux-gnu-accel-nvptx-none-gcc
returned 1 exit status
compilation terminated.
lto-wrapper: fatal error:
/autofs/nccs-svm1_sw/summitdev/gcc/7.1.1-20170802/bin/../libexec/gcc/powerpc64le-unknown-linux-gnu/7.1.1//accel/nvptx-none/mkoffload
returned 1 exit status
compilation terminated.
/usr/bin/ld: lto-wrapper failed
/usr/bin/sha1sum: main.x: No such file or directory
collect2: error: ld returned 1 exit status
make: *** [main.x] Error 1

But looking at the object in question shows:

-bash-4.2$ nm aclass.o
 U .TOC.
 d .offload_func_table
 d .offload_var_table
 U GOMP_parallel
 U GOMP_target_enter_exit_data
 U GOMP_target_ext
 U GOMP_teams
0350 T _ZN6aclass4copyERKS_
0250 T _ZN6aclass8allocateERKj
0130 t _ZN6master4copyERKS_._omp_fn.0
 t _ZN6master4copyERKS_._omp_fn.1
 d _ZZN6master10deallocateEvE18.omp_data_kinds.20
 b _ZZN6master10deallocateEvE18.omp_data_sizes.19
0002 d _ZZN6master4copyERKS_E18.omp_data_kinds.11
0008 d _ZZN6master4copyERKS_E18.omp_data_sizes.10
 U _ZdaPv
 U _Znam
 U __cxa_throw_bad_array_new_length
0001 C __gnu_lto_v1
 U omp_get_num_teams
 U omp_get_num_threads
 U omp_get_team_num
 U omp_get_thread_num

The function is actually there.

Best Regards
Thorsten Kurth

[Bug c++/81896] omp target enter data not recognized

2017-10-20 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81896

--- Comment #2 from Thorsten Kurth  ---
Hello,

another data point:
when I create a dummy variable, it works: for example alias data to tmp and
then use tmp. I think this is not working for the same reason one cannot
arbitrarily put class member variables into openmp clauses.

Best Regards
Thorsten Kurth

[Bug libgomp/80859] Performance Problems with OpenMP 4.5 support

2017-10-20 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #28 from Thorsten Kurth  ---
Hello,

can someone please give me an update on this bug?

Best Regards
Thorsten Kurth

[Bug c++/81896] omp target enter data not recognized

2017-10-20 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81896

--- Comment #1 from Thorsten Kurth  ---
Hello,

is this report actually being worked on? It is in unconfirmed state for quite a
while now.

Best Regards
Thorsten Kurth

[Bug c++/82629] New: OpenMP 4.5 Target Region mangling problem

2017-10-19 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82629

Bug ID: 82629
   Summary: OpenMP 4.5 Target Region mangling problem
   Product: gcc
   Version: 7.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thorstenkurth at me dot com
  Target Milestone: ---

Dear Sir/Madam,

I run into linking issues with gcc (GCC) 7.1.1 20170718 and OpenMP 4.5 target
offloading. I am compiling a mixed fortran/C++ code where target regions can be
in both source files. The final linking stage fails with the following error
message:

mpic++  -g -O3 -std=c++11  -fopenmp -foffload=nvptx-none
-DCG_USE_OLD_CONVERGENCE_CRITERIA -DBL_OMP_FABS -DNDEBUG -DBL_USE_MPI
-DBL_USE_OMP -DBL_GCC_VERSION='7.1.1' -DBL_GCC_MAJOR_VERSION=7
-DBL_GCC_MINOR_VERSION=1 -DBL_SPACEDIM=3 -DBL_FORT_USE_UNDERSCORE -DBL_Linux
-DMG_USE_FBOXLIB -DBL_USE_F_BASELIB -DBL_USE_FORTRAN_MPI -DUSE_F90_SOLVERS -I.
-I../../Src/C_BoundaryLib -I../../Src/LinearSolvers/C_CellMG
-I../../Src/LinearSolvers/C_CellMG4 -I../../Src/C_BaseLib
-I../../Src/C_BoundaryLib -I../../Src/C_BaseLib
-I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4
-I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/C_to_F_MG
-I../../Src/LinearSolvers/F_MG -I../../Src/LinearSolvers/F_MG
-I../../Src/F_BaseLib -I../../Src/F_BaseLib -L.  -o main3d.gnu.MPI.OMP.ex
o/3d.gnu.MPI.OMP.EXE/main.o o/3d.gnu.MPI.OMP.EXE/writePlotFile.o
o/3d.gnu.MPI.OMP.EXE/FabSet.o o/3d.gnu.MPI.OMP.EXE/BndryRegister.o
o/3d.gnu.MPI.OMP.EXE/Mask.o o/3d.gnu.MPI.OMP.EXE/MultiMask.o
o/3d.gnu.MPI.OMP.EXE/BndryData.o o/3d.gnu.MPI.OMP.EXE/InterpBndryData.o
o/3d.gnu.MPI.OMP.EXE/MacBndry.o o/3d.gnu.MPI.OMP.EXE/ABecLaplacian.o
o/3d.gnu.MPI.OMP.EXE/CGSolver.o o/3d.gnu.MPI.OMP.EXE/LinOp.o
o/3d.gnu.MPI.OMP.EXE/Laplacian.o o/3d.gnu.MPI.OMP.EXE/MultiGrid.o
o/3d.gnu.MPI.OMP.EXE/ABec2.o o/3d.gnu.MPI.OMP.EXE/ABec4.o
o/3d.gnu.MPI.OMP.EXE/BoxLib.o o/3d.gnu.MPI.OMP.EXE/ParmParse.o
o/3d.gnu.MPI.OMP.EXE/Utility.o o/3d.gnu.MPI.OMP.EXE/UseCount.o
o/3d.gnu.MPI.OMP.EXE/DistributionMapping.o
o/3d.gnu.MPI.OMP.EXE/ParallelDescriptor.o o/3d.gnu.MPI.OMP.EXE/VisMF.o
o/3d.gnu.MPI.OMP.EXE/Arena.o o/3d.gnu.MPI.OMP.EXE/BArena.o
o/3d.gnu.MPI.OMP.EXE/CArena.o o/3d.gnu.MPI.OMP.EXE/OMPArena.o
o/3d.gnu.MPI.OMP.EXE/NFiles.o o/3d.gnu.MPI.OMP.EXE/FabConv.o
o/3d.gnu.MPI.OMP.EXE/FPC.o o/3d.gnu.MPI.OMP.EXE/Box.o
o/3d.gnu.MPI.OMP.EXE/IntVect.o o/3d.gnu.MPI.OMP.EXE/IndexType.o
o/3d.gnu.MPI.OMP.EXE/Orientation.o o/3d.gnu.MPI.OMP.EXE/Periodicity.o
o/3d.gnu.MPI.OMP.EXE/RealBox.o o/3d.gnu.MPI.OMP.EXE/BoxList.o
o/3d.gnu.MPI.OMP.EXE/BoxArray.o o/3d.gnu.MPI.OMP.EXE/BoxDomain.o
o/3d.gnu.MPI.OMP.EXE/FArrayBox.o o/3d.gnu.MPI.OMP.EXE/IArrayBox.o
o/3d.gnu.MPI.OMP.EXE/BaseFab.o o/3d.gnu.MPI.OMP.EXE/MultiFab.o
o/3d.gnu.MPI.OMP.EXE/iMultiFab.o o/3d.gnu.MPI.OMP.EXE/FabArray.o
o/3d.gnu.MPI.OMP.EXE/CoordSys.o o/3d.gnu.MPI.OMP.EXE/Geometry.o
o/3d.gnu.MPI.OMP.EXE/MultiFabUtil.o o/3d.gnu.MPI.OMP.EXE/BCRec.o
o/3d.gnu.MPI.OMP.EXE/PhysBCFunct.o o/3d.gnu.MPI.OMP.EXE/PlotFileUtil.o
o/3d.gnu.MPI.OMP.EXE/BLProfiler.o o/3d.gnu.MPI.OMP.EXE/BLBackTrace.o
o/3d.gnu.MPI.OMP.EXE/MemPool.o o/3d.gnu.MPI.OMP.EXE/MGT_Solver.o
o/3d.gnu.MPI.OMP.EXE/FMultiGrid.o o/3d.gnu.MPI.OMP.EXE/MultiFab_C_F.o
o/3d.gnu.MPI.OMP.EXE/backtrace_c.o o/3d.gnu.MPI.OMP.EXE/fabio_c.o
o/3d.gnu.MPI.OMP.EXE/timer_c.o o/3d.gnu.MPI.OMP.EXE/BLutil_F.o
o/3d.gnu.MPI.OMP.EXE/BLParmParse_F.o o/3d.gnu.MPI.OMP.EXE/BLBoxLib_F.o
o/3d.gnu.MPI.OMP.EXE/BLProfiler_F.o o/3d.gnu.MPI.OMP.EXE/INTERPBNDRYDATA_3D.o
o/3d.gnu.MPI.OMP.EXE/LO_UTIL.o o/3d.gnu.MPI.OMP.EXE/ABec_3D.o
o/3d.gnu.MPI.OMP.EXE/ABec_UTIL.o o/3d.gnu.MPI.OMP.EXE/LO_3D.o
o/3d.gnu.MPI.OMP.EXE/LP_3D.o o/3d.gnu.MPI.OMP.EXE/MG_3D.o
o/3d.gnu.MPI.OMP.EXE/ABec2_3D.o o/3d.gnu.MPI.OMP.EXE/ABec4_3D.o
o/3d.gnu.MPI.OMP.EXE/COORDSYS_3D.o o/3d.gnu.MPI.OMP.EXE/FILCC_3D.o
o/3d.gnu.MPI.OMP.EXE/BaseFab_nd.o o/3d.gnu.MPI.OMP.EXE/threadbox.o
o/3d.gnu.MPI.OMP.EXE/MultiFabUtil_3d.o o/3d.gnu.MPI.OMP.EXE/mempool_f.o
o/3d.gnu.MPI.OMP.EXE/compute_defect.o o/3d.gnu.MPI.OMP.EXE/coarsen_coeffs.o
o/3d.gnu.MPI.OMP.EXE/mg_prolongation.o o/3d.gnu.MPI.OMP.EXE/ml_prolongation.o
o/3d.gnu.MPI.OMP.EXE/cc_mg_cpp.o o/3d.gnu.MPI.OMP.EXE/cc_applyop.o
o/3d.gnu.MPI.OMP.EXE/cc_ml_resid.o o/3d.gnu.MPI.OMP.EXE/cc_smoothers.o
o/3d.gnu.MPI.OMP.EXE/cc_stencil.o o/3d.gnu.MPI.OMP.EXE/cc_stencil_apply.o
o/3d.gnu.MPI.OMP.EXE/cc_stencil_fill.o
o/3d.gnu.MPI.OMP.EXE/cc_interface_stencil.o
o/3d.gnu.MPI.OMP.EXE/cc_mg_tower_smoother.o o/3d.gnu.MPI.OMP.EXE/itsol.o
o/3d.gnu.MPI.OMP.EXE/mg.o o/3d.gnu.MPI.OMP.EXE/mg_tower.o
o/3d.gnu.MPI.OMP.EXE/ml_cc.o o/3d.gnu.MPI.OMP.EXE/ml_nd.o
o/3d.gnu.MPI.OMP.EXE/ml_norm.o o/3d.gnu.MPI.OMP.EXE/tridiag.o
o/3d.gnu.MPI.OMP.EXE/nodal_mg_cpp.o o/3d.gnu.MPI.OMP.EXE/nodal_mask.o
o/3d.gnu.MPI.OMP.EXE/nodal_divu.o
o/3d.gnu.MPI.OMP.EXE/nodal_interface_stencil.o
o/3d.gnu.MPI.OMP.EXE/nodal_newu.o o/3d.gnu.MPI.OMP.EXE/nodal_s

[Bug c++/81896] New: omp target enter data not recognized

2017-08-18 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81896

Bug ID: 81896
   Summary: omp target enter data not recognized
   Product: gcc
   Version: 7.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thorstenkurth at me dot com
  Target Milestone: ---

Created attachment 42005
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42005&action=edit
small test case

Dear Sir/Madam,

I am not sure if my report got posted the first time because I cannot find it
any more (did not receive a notification about it and it is also not marked
invalid somewhere).
Therefore, I will post it again.

It seems that gcc has problems with the omp target enter/exit data constructs.
When I compile the appended code I get:

g++ -O2 -std=c++11 -fopenmp -foffload=nvptx-none -c aclass.cpp -o aclass.o
In file included from aclass.h:2:0,
 from aclass.cpp:1:
masterclass.h: In member function 'void master::allocate(const unsigned int&)':
masterclass.h:10:50: error: 'master::data' is not a variable in 'map' clause
 #pragma omp target enter data map(alloc: data[0:size*sizeof(double)])
  ^~~~
masterclass.h:10:9: error: '#pragma omp target enter data' must contain at
least one 'map' clause
 #pragma omp target enter data map(alloc: data[0:size*sizeof(double)])
 ^~~
masterclass.h: In member function 'void master::deallocate()':
masterclass.h:15:51: error: 'master::data' is not a variable in 'map' clause
 #pragma omp target exit data map(release: data[:0])
   ^~~~
masterclass.h:15:9: error: '#pragma omp target exit data' must contain at least
one 'map' clause
 #pragma omp target exit data map(release: data[:0])
 ^~~
make: *** [aclass.o] Error 1

The same code compiles fine when using XLC. 

Best Regards
Thorsten Kurth

[Bug c++/81850] New: OpenMP target enter data compilation issues

2017-08-14 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81850

Bug ID: 81850
   Summary: OpenMP target enter data compilation issues
   Product: gcc
   Version: 7.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thorstenkurth at me dot com
  Target Milestone: ---

Created attachment 41990
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41990&action=edit
Test case

Dear Sir/Madam,

g++ 7.1.1 cannot compile correct OpenMP 4.5 code. I have attached a small
example program I initially developer to demonstrate a compiler bug for XLC.
GCC throws the following error message on compilation:

g++ -O2 -std=c++11 -fopenmp -foffload=nvptx-none -c aclass.cpp -o aclass.o
In file included from aclass.h:2:0,
 from aclass.cpp:1:
masterclass.h: In member function 'void master::allocate(const unsigned int&)':
masterclass.h:10:50: error: 'master::data' is not a variable in 'map' clause
 #pragma omp target enter data map(alloc: data[0:size*sizeof(double)])
  ^~~~
masterclass.h:10:9: error: '#pragma omp target enter data' must contain at
least one 'map' clause
 #pragma omp target enter data map(alloc: data[0:size*sizeof(double)])
 ^~~
masterclass.h: In member function 'void master::deallocate()':
masterclass.h:15:51: error: 'master::data' is not a variable in 'map' clause
 #pragma omp target exit data map(release: data[:0])
   ^~~~
masterclass.h:15:9: error: '#pragma omp target exit data' must contain at least
one 'map' clause
 #pragma omp target exit data map(always, release: data[:0])


To me it seems that it cannot recognize the "alloc" clause. 

Best Regards
Thorsten Kurth

[Bug libgomp/80859] Performance Problems with OpenMP 4.5 support

2017-08-08 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #27 from Thorsten Kurth  ---
Hello Jakub,

I wanted to follow up on this. Is there any progress on this issue?

Best Regards
Thorsten Kurth

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-26 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #26 from Thorsten Kurth  ---
Hello Jakub,

thanks for the clarification. So a team maps to a CTA which is somewhat
equivalent to a block in CUDA language, correct? And it is good to have some
categorical equivalency between GPU and CPU code (SIMD units <> WARPS) instead
of mapping SIMT threads to OpenMP threads, that makes it easier for making it
portable. 

About my mapping "problem" is there an elegant way for doing this or does only
brute force work, i.e. by writing additional member functions returning
pointers etc.?
In general, the OpenMP mapping business is very verbose (not your fault, I
know), it makes the code very annoying to read.

Best
Thorsten

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-26 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #24 from Thorsten Kurth  ---
Hello Jakub,

I know that the section you mean is racey and gets the wrong number of threads
is not right but I put this in in order to see if I get the correct numbers on
a CPU (I am not working on a GPU yet, that will be next). Most of the defines
for setting number of teams and threads in outer loop are for playing around
with that and see what works best, in the end this will be removed. This code
is not finished by any means, it is a moving target and under active
development. Only the OpenMP3 version is considered done and works well.

You said that SIMD pragmas is missing, and this is for a reason. First of all,
the code is memory bandwidth bound, so it has a rather low AI so that
vectorization does not help a lot. Of course vectorization helps in the sense
that the loads and stores are vectorized and the prefetcher works more
efficiently. But we made sure that the (Intel) compiler vectorizes the inner
loops automatically nicely. Putting in explicit SIMD pragmas made the code
performance worse, because in that case the (Intel) compiler generates worse
code in some cases (according to some Intel compiler guys, this is because if
the compiler sees a SIMD statement it will not try to partially unroll loops
etc. and might generate more masks than necessary). So auto vectorization works
fine here so we have not revisited this issue. The GNU compiler might be
different and I did not look at what the auto-vectorizer did.

The more important questions I have are the following:
1) as you see the codes has two levels of parallelism. On the CPU, it is most
efficient to tile the boxes (this is the loop with the target distribute) and
then let one thread work on a box. I added another level of parallelism inside
that box, because on the GPU you have more thread and might want to exploit
more parallelism. Talking to folks from IBM at an OpemMP 4.5 hackathon at least
this is what they suggested. 
So my question is: when you have a target teams distribute, will be one team
equal to a CUDA WARP or will it be something bigger? In that case, I would like
to have one WARP working on a box and not letting different ptx threads working
on individual boxes. To summarize: on the CPU the OpenMP threading should be
such that one threads gets a box and the vectorization works on the inner loop
(which is fine, that works), and in the CUDA case one team/WARP should work on
a box and then SIMT parallelize the work on the box.

2) related to this: how does ptx behave when it sees a SIMD statement in a
target region? Is that ignored or somehow interpreted? In any case, how does
OpenMP do the mapping between CUDA WARP <-> OpenMP CPU thread, because this is
the closest equivalence I would say. I would guess it ignores SIMD pragmas and
just acts on thread level, where in the CUDA world one thread more or less acts
like a SIMD lane on the CPU.

3) this device mapping business is extremely verbose for C++ classes. For
example the MFIter class amfi, comfy, solnLmfi whatever are not correctly
mapped yet and would cause trouble on the GPU (the intel compiler complains
that the stuff is not bitwise copyable, GNU complies it though). These are
classes containing other class pointers. So in order to map that properly I
would technically need to map the dereferenced data member of the member class
of the first class, correct? As an example, you have a class with
std::vector * vector data member. You technically need to map the
vector.data() member to the device, right? That however tells you that you need
to be able to access that guy, i.e. it should not be a protected class member.
So what happens when you have a class which you cannot change but need to map
private/protected members of it? The example at hand is the MFIter class which
has this:

protected:

const FabArrayBase& fabArray;

IntVect tile_size;

unsigned char flags;
int   currentIndex;
int   beginIndex;
int   endIndex;
IndexType typ;

const Array* index_map;
const Array* local_index_map;
const Array* tile_array;

void Initialize ();

It has these array pointers. So technically this is (to my knowledge, I do not
know the code fully) an array of indices which determines which global indices
the iterator is in fact iterating over. This stuff can be shared among the
threads and it is only read and never written. Nevertheless, it needs to know
the indices on the device so the index_map etc. needs to be mapped. Now, Array
is just a class with a public member of std::vector. But in order to map the
index_map class member I would need to have access to it, so that I can map the
underlying std::vector data member. Do you know what I mean? How is this done
in the most elegant way in OpenMP?

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-25 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #22 from Thorsten Kurth  ---
Hello Jakub,

that is stuff for Intel vTune. I have commented it out and added the NUM_TEAMS
defines in the GNUmakefile. Please pull the latest changes.

Best and thanks
Thorsten

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-25 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #20 from Thorsten Kurth  ---
To compile the code, edit the GNUmakefile to suit your needs (feel free to ask
any questions) and in order to run it run the generated executable, called
something like

main3d.XXX...

and the XXX tell you if you compiled with MPI, OpenMP, etc. There is an inputs
file you just pass to it:

./main3d.. inputs

That's it. Tell me if you need more info.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #19 from Thorsten Kurth  ---
Thanks you very much. I am sorry that I do not have a simpler test case. The
kernel which is executed is in the same directory as ABecLaplacian and called
MG_3D_cpp.cpp.

We have seen similar problems with the fortran kernels (they are scattered
across multiple files) but the fortran kernels and our C++ ports give the same
performance with the original OpenMP parallelization. In any case, I wonder why
the compiler honors the target region even if -march=knl is specified. However,
please let me know if you have further questions. I can guide you through that
code. The code is big but the relevant files are technically 2 or 3 and the
relevant lines of code also not very many.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #17 from Thorsten Kurth  ---
the result though is correct, I verified that both codes generate the correct
output.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #16 from Thorsten Kurth  ---
FYI, the code is:

https://github.com/zronaghi/BoxLib.git

in branch

cpp_kernels_openmp4dot5

and then in Src/LinearSolvers/C_CellMG

the file ABecLaplacian.cpp. For example, lines 542 and 543 can be commented out
and commented in and when the test case in run you get significant slowdown
when the code is compiled with that stuff commented in. I did not map all the
scalar stuff so it might be that this is a problem. But in any case, it should
not create copies of that stuff at all in my opinion.

Please don't look at that code right now because it is a bit convoluted I just
wanted to show that this issue appears. So when I have the target section I
mentioned above commented in I get by running:

#!/bin/bash

export OMP_NESTED=false
export OMP_NUM_THREADS=64
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
export OMP_MAX_ACTIVE_LEVELS=1

execpath="/project/projectdirs/mpccc/tkurth/Portability/BoxLib/Tutorials/MultiGrid_C"
exec=`ls -latr ${execpath}/main3d.*.MPI.OMP.ex | awk '{print $9}'`

#execute
${exec} inputs

the following:

tkurth@nid06760:/global/cscratch1/sd/tkurth/boxlib_omp45> ./run_example.sh
MPI initialized with 1 MPI processes
OMP initialized with 64 OMP threads
Using Dirichlet or Neumann boundary conditions.
Grid resolution : 128 (cells)
Domain size : 1 (length unit) 
Max_grid_size   : 32 (cells)
Number of grids : 64
Sum of RHS  : -2.68882138776405e-17

Solving with BoxLib C++ solver 
WARNING: using C++ kernels in LinOp
WARNING: using C++ MG solver with C kernels
MultiGrid: Initial rhs= 135.516568492921
MultiGrid: Initial residual   = 135.516568492921
MultiGrid: Iteration   1 resid/bnorm = 0.379119045820053
MultiGrid: Iteration   2 resid/bnorm = 0.0107971623268356
MultiGrid: Iteration   3 resid/bnorm = 0.000551321916982188
MultiGrid: Iteration   4 resid/bnorm = 3.55014555643671e-05
MultiGrid: Iteration   5 resid/bnorm = 2.57082340920002e-06
MultiGrid: Iteration   6 resid/bnorm = 1.90970439886018e-07
MultiGrid: Iteration   7 resid/bnorm = 1.44525222814178e-08
MultiGrid: Iteration   8 resid/bnorm = 1.10675190626368e-09
MultiGrid: Iteration   9 resid/bnorm = 8.55424251440489e-11
MultiGrid: Iteration   9 resid/bnorm = 8.55424251440489e-11
, Solve time: 5.84898591041565, CG time: 0.162226438522339
   Converged res < eps_rel*max(bnorm,res_norm)
   Run time  : 5.98936820030212

Unused ParmParse Variables:
[TOP]::hypre.solver_flag(nvals = 1)  :: [1]
[TOP]::hypre.pfmg_rap_type(nvals = 1)  :: [1]
[TOP]::hypre.pfmg_relax_type(nvals = 1)  :: [2]
[TOP]::hypre.num_pre_relax(nvals = 1)  :: [2]
[TOP]::hypre.num_post_relax(nvals = 1)  :: [2]
[TOP]::hypre.skip_relax(nvals = 1)  :: [1]
[TOP]::hypre.print_level(nvals = 1)  :: [1]
done.

When I comment it out, recompile, I get:

tkurth@nid06760:/global/cscratch1/sd/tkurth/boxlib_omp45> ./run_example.sh
MPI initialized with 1 MPI processes
OMP initialized with 64 OMP threads
Using Dirichlet or Neumann boundary conditions.
Grid resolution : 128 (cells)
Domain size : 1 (length unit) 
Max_grid_size   : 32 (cells)
Number of grids : 64
Sum of RHS  : -2.68882138776405e-17

Solving with BoxLib C++ solver 
WARNING: using C++ kernels in LinOp
WARNING: using C++ MG solver with C kernels
MultiGrid: Initial rhs= 135.516568492921
MultiGrid: Initial residual   = 135.516568492921
MultiGrid: Iteration   1 resid/bnorm = 0.379119045820053
MultiGrid: Iteration   2 resid/bnorm = 0.0107971623268356
MultiGrid: Iteration   3 resid/bnorm = 0.000551321916981978
MultiGrid: Iteration   4 resid/bnorm = 3.5501455563633e-05
MultiGrid: Iteration   5 resid/bnorm = 2.5708234090034e-06
MultiGrid: Iteration   6 resid/bnorm = 1.90970439781153e-07
MultiGrid: Iteration   7 resid/bnorm = 1.44525225042545e-08
MultiGrid: Iteration   8 resid/bnorm = 1.10675108045705e-09
MultiGrid: Iteration   9 resid/bnorm = 8.55424251440489e-11
MultiGrid: Iteration   9 resid/bnorm = 8.55424251440489e-11
, Solve time: 0.759385108947754, CG time: 0.14183521270752
   Converged res < eps_rel*max(bnorm,res_norm)
   Run time  : 0.879786014556885

Unused ParmParse Variables:
[TOP]::hypre.solver_flag(nvals = 1)  :: [1]
[TOP]::hypre.pfmg_rap_type(nvals = 1)  :: [1]
[TOP]::hypre.pfmg_relax_type(nvals = 1)  :: [2]
[TOP]::hypre.num_pre_relax(nvals = 1)  :: [2]
[TOP]::hypre.num_post_relax(nvals = 1)  :: [2]
[TOP]::hypre.skip_relax(nvals = 1)  :: [1]
[TOP]::hypre.print_level(nvals = 1)  :: [1]
done.

it is like 7.3x slowdown. The smoothing kernel (gauss-seidel red-black) is the
most expensive kernel in the Multi-Grid code, so I see the biggest effect here.
But the other kernels (prolongation, restriction, dot products etc) have
slowdowns as well amounting to a total of more than 10x for the whole app.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #15 from Thorsten Kurth  ---
The code I care about definitely has optimization enabled. For the fortran
stuff it does (for example): 

ftn  -g -O3 -ffree-line-length-none -fno-range-check -fno-second-underscore
-Jo/3d.gnu.MPI.OMP.EXE -I o/3d.gnu.MPI.OMP.EXE -fimplicit-none  -fopenmp -I.
-I../../Src/C_BoundaryLib -I../../Src/LinearSolvers/C_CellMG
-I../../Src/LinearSolvers/C_CellMG4 -I../../Src/C_BaseLib
-I../../Src/C_BoundaryLib -I../../Src/C_BaseLib
-I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4
-I/opt/intel/vtune_amplifier_xe_2017.2.0.499904/include
-I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/C_to_F_MG
-I../../Src/LinearSolvers/F_MG -I../../Src/LinearSolvers/F_MG
-I../../Src/F_BaseLib -I../../Src/F_BaseLib -c
../../Src/LinearSolvers/F_MG/itsol.f90 -o o/3d.gnu.MPI.OMP.EXE/itsol.o
Compiling cc_mg_tower_smoother.f90 ...

and for the C++ stuff it does

CC  -g -O3 -std=c++14  -fopenmp -g -DCG_USE_OLD_CONVERGENCE_CRITERIA
-DBL_OMP_FABS -DDEVID=0 -DNUM_TEAMS=1 -DNUM_THREADS_PER_BOX=1 -march=knl 
-DNDEBUG -DBL_USE_MPI -DBL_USE_OMP -DBL_GCC_VERSION='6.3.0'
-DBL_GCC_MAJOR_VERSION=6 -DBL_GCC_MINOR_VERSION=3 -DBL_SPACEDIM=3
-DBL_FORT_USE_UNDERSCORE -DBL_Linux -DMG_USE_FBOXLIB -DBL_USE_F_BASELIB
-DBL_USE_FORTRAN_MPI -DUSE_F90_SOLVERS -I. -I../../Src/C_BoundaryLib
-I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4
-I../../Src/C_BaseLib -I../../Src/C_BoundaryLib -I../../Src/C_BaseLib
-I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4
-I/opt/intel/vtune_amplifier_xe_2017.2.0.499904/include
-I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/C_to_F_MG
-I../../Src/LinearSolvers/F_MG -I../../Src/LinearSolvers/F_MG
-I../../Src/F_BaseLib -I../../Src/F_BaseLib -c ../../Src/C_BaseLib/FPC.cpp -o
o/3d.gnu.MPI.OMP.EXE/FPC.o
Compiling Box.cpp ...

But the kernels I care about are in C++.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #13 from Thorsten Kurth  ---
Hello Jakub,

the compiler options are just -fopenmp. I am sure it does not have to do
anything with vectorization as I compare the code runtime with and without the
target directives and thus vectorization should be the same between them. The
remaining OpenMP sections are the same. In our work we have not seen 10x
because of insufficient vectorization, it is usually because of cache locality
but that is the same for OMP 4.5 and OMP 3 because the loops are not touched.
I do not specify an ISA choice, but I will try specifying KNL now and will tell
you what the compiler is going to do.

Best
Thorsten

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #11 from Thorsten Kurth  ---
Hello Jakub,

yes, you are right. I thought that map(tofrom:) is the default mapping but
I might be wrong. In any case, teams is always 1. So this code is basically
just data streaming  so there is no need for a detailed performance analysis.
When I timed the code (not profiling it) the OpenMP 4.5 code had a tiny bit
more overhead, but not significant. 
However, we might nevertheless learn from that. 

Best
Thorsten

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #9 from Thorsten Kurth  ---
Sorry, in the second run I set the number of threads to 12. I think the code
works as expected.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #8 from Thorsten Kurth  ---
Here is the output of the get_num_threads section:

[tkurth@cori02 omp_3_vs_45_test]$ export OMP_NUM_THREADS=32
[tkurth@cori02 omp_3_vs_45_test]$ ./nested_test_omp_4dot5.x
We got 1 teams and 32 threads.

and:

[tkurth@cori02 omp_3_vs_45_test]$ ./nested_test_omp_4dot5.x
We got 1 teams and 12 threads.

I think the code is OK.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #7 from Thorsten Kurth  ---
Hello Jakub,

thanks for your comment but I think the parallel for is not racey. Every thread
is working a block of i-indices so that is fine. The dotprod kernel is actually
a kernel from the OpenMP standard documentation and I am sure that this is not
racey. 

The example with the regions you mentioned I do not see a problem with that
either. By default, everything is shared so the variable is updated by all the
threads/teams with the same value. 

The issue is that num_teams=1 is only true for CPU, for GPU it is OS, driver,
architecture and whatever dependent. 

Concerning splitting distribute and parallel: I tried both combinations and
found that they behave the same. But in the end I split it so that I could
comment out the distribute section to see if that makes a performance
difference (and it does).

I believe that the overhead instructions are responsible for the bad
performance because that is the only thing which distinguishes the target
annotated code from the plain openmp code. I used vtune to look at thread
utilization and they look similar, L1, L2 hit rates are very close (100% vs 99%
and 92% vs 89%) for the plain openmp and for the target annotated code. BUT the
performance of the target annotated code can be up to 10x worse. So I think
there might be register spilling due to copying a large amount of variables. If
you like I can point you to the github repo code (BoxLib) which clearly
exhibits this issue. This small test case only shows minor overhead of OpenMP
4.5 vs, say, OpenMP 3 but it clearly generates some additional overhead.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #5 from Thorsten Kurth  ---
To clarify the problem:
I think that the additional movq, pushq and other instructions generated when
using the target directive can cause a big hit on the performance. I understand
that these instructions are necessary when offloading is used but in case when
I compile for native architecture those should not be there. So maybe I am just
missing a GNU compiler flag which disables offloading and lets the compiler
ignore the target, teams and distribute directives at compile time but still
honoring all the other OpenMP constructs. 
Is there a way to do that right now and if not, is there a way to add that flag
that supports this.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #4 from Thorsten Kurth  ---
Created attachment 41415
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41415&action=edit
Testcase

This is the test case. The files ending on .as contain the assembly code with
and without target region

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #3 from Thorsten Kurth  ---
Created attachment 41414
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41414&action=edit
OpenMP 4.5 Testcase

This is the source code

[Bug c++/80859] New: Performance Problems with OpenMP 4.5 support

2017-05-22 Thread thorstenkurth at me dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

Bug ID: 80859
   Summary: Performance Problems with OpenMP 4.5 support
   Product: gcc
   Version: 6.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thorstenkurth at me dot com
  Target Milestone: ---

Dear Sir/Madam,

I am working on the Cori HPC system, a Cray XC-40 with intel Xeon Phi 7250. I
probably found a performance "bug" when using the OpenMP 4.5 target directives.
It seems to me that the GNU compiler generates unnecessary move and push
functions when a 

#pragma omp target region is present but no offloading is used.

I have attached a test case to illustrate that problem. Please compile the
nested_test_omp_4dot5.x in the directory (don't be confused by the name, I am
not using nested OpenMP here). Then go into the corresponding .cpp file and
comment out the target-related directives (target teams and distribute),
compile again and then compare the assembly code. The code with the target
directives has more pushes and moves than the one without. I think I also place
the output of that process in the directory already, the files ending in .as.

The performance overhead is marginal here but I am currently working on a
Department of Energy performance portability project and I am exploring the
usefulness of OpenMP 4.5. The code we retargeting is a Geometric Multigrid in
the BoxLiv/AMReX framework and there the overhead is significant. I could
observe as much as 10x slowdown accumulated throughout the app. This code is
bigger and thus I do not want to demonstrate that here but I could send you an
invitation to the github repo if requested. In my opinion, if no offloading is
used, the compiler should just ignore the target region statements and just
default to plain OpenMP. 

Please let me know what you think.

Best Regards
Thorsten Kurth
National Energy Research Scientific Computing Center
Lawrence Berkeley National Laboratory

[Bug c/60101] New: Long compile times when mixed complex floating point datatypes are used in lengthy expressions

2014-02-06 Thread thorstenkurth at me dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60101

Bug ID: 60101
   Summary: Long compile times when mixed complex floating point
datatypes are used in lengthy expressions
   Product: gcc
   Version: 4.8.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thorstenkurth at me dot com

Created attachment 32071
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=32071&action=edit
Archive which includes test case.

In the example I copied below, the double.c file compiles instantly whereas the
float.c file takes very long. This is a truncated version of an even longer
file (more lines of code in the loop) and the compile time for the float.c file
grows like N^3, where N is the number of lines. Here is the output of the long
version for 4.8.2:

0x40ae17 do_spec_1
../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5269
0x40ae17 do_spec_1
../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5269
0x40c875 process_brace_body
../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5872
0x40c875 process_brace_body
../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5872
0x40c875 handle_braces
../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5786
0x40c875 handle_braces
../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5786
0x40ae17 do_spec_1
../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5269
0x40c875 process_brace_body
../../gcc-4.8.2-src/gcc-4.8.2/gcc/gcc.c:5872
and more messages like that

The attached files both compile, but they the float.c takes significantly
longer.
The only difference between those files is that the temporary variable sum is
double complex in the working version and float complex in the non-working
version. So I guess, the compiler tries to reorganize the complex
multiplications and additions so that intermediate floating point results can
be used (this is what I guess). Both files compile using the icc (>=11.0) and
clang/LLVM almost instantly. It also works for gcc<=4.4.