Hi Barry,
Here is what I have:
1. The hpddm issues have been all solved (you can't see no more hpddm
failures here:
https://giref.ulaval.ca/~cmpgiref/petsc-main-debug/2021.03.29.02h00m02s_make_test.log)
2. For Hypre, I think it is indeed not a bug but a feature, as far as I
can see what has been told on the hypre discussion
list it is said "It still depends on the number of threads, that can’t
be avoided" (
https://github.com/hypre-space/hypre/issues/303#issuecomment-800442755 )
and here
https://www.researchgate.net/publication/220411740_Multigrid_Smoothers_for_Ultraparallel_Computing,
into section 7.3, we have some interesting informations, as:
Figure 7.6 clearly illustrates that convergence degrades with the
addition of threads for hybrid SGS;
....
The 3D sphere problem is the most extreme example because AMG-CG with
hybrid SGS no longer converges with the addition of threading.
but I might have misunderstood since I am not an expert for that...
3. For SuperLU_Dist, I have tried to build SuperLU_dist out of PETSc to
run the tests from superlu itself: sadly the bug is not showing up (see
https://github.com/xiaoyeli/superlu_dist/issues/69).
I would like to build a reproducer superlu_dist example from what is
done in the faulty test:
ksp_ksp_tutorials-ex5
that is buggy when called from PETSc: what bugs me, is that many other
PETSc tests are running fine with superlu_dist: maybe something is
uniquely done in ksp_ksp_tutorials-ex5 ?
So I think it worth digging into #3: the simple thing I have not yet
done is retreiving the stack when it fails (timeout).
And a question: when you state that you upgraded to OpenMPI 4.1 you mean
for one of your automated (docker?) compilation into the gitlab pipelines?
Thanks for taking news! :)
Eric
On 2021-03-30 1:47 p.m., Barry Smith wrote:
Eric,
How are things going on this OpenMP front? Any bug fixes from
hypre or SuperLU_DIST?
BTW: we have upgraded to OpenMPI 4.1 perhaps this resolves some
issues?
Barry
On Mar 22, 2021, at 2:07 PM, Eric Chamberland
<[email protected]
<mailto:[email protected]>> wrote:
I added some information here:
https://github.com/xiaoyeli/superlu_dist/issues/69#issuecomment-804318719
Maybe someone can say more than I on what PETSc tries to do with the
2 mentioned tutorials that are timing out...
Thanks,
Eric
On 2021-03-15 11:31 a.m., Eric Chamberland wrote:
Reported timeout bugs to SuperLU_dist too:
https://github.com/xiaoyeli/superlu_dist/issues/69
Eric
On 2021-03-14 2:18 p.m., Eric Chamberland wrote:
Done:
https://github.com/hypre-space/hypre/issues/303
Maybe I will need some help about PETSc to answer their questions...
Eric
On 2021-03-14 3:44 a.m., Stefano Zampini wrote:
Eric
You should report these HYPRE issues upstream
https://github.com/hypre-space/hypre/issues
<https://github.com/hypre-space/hypre/issues>
On Mar 14, 2021, at 3:44 AM, Eric Chamberland
<[email protected]
<mailto:[email protected]>> wrote:
For us it clearly creates problems in real computations...
I understand the need to have clean test for PETSc, but for me,
it reveals that hypre isn't usable with more than one thread for
now...
Another solution: force single-threaded configuration for hypre
until this is fixed?
Eric
On 2021-03-13 8:50 a.m., Pierre Jolivet wrote:
-pc_hypre_boomeramg_relax_type_all Jacobi =>
Linear solve did not converge due to DIVERGED_INDEFINITE_PC
iterations 3
-pc_hypre_boomeramg_relax_type_all l1scaled-Jacobi =>
OK, independently of the architecture it seems (Eric Docker
image with 1 or 2 threads or my macOS), but contraction factor
is higher
Linear solve converged due to CONVERGED_RTOL iterations 8
Linear solve converged due to CONVERGED_RTOL iterations 24
Linear solve converged due to CONVERGED_RTOL iterations 26
v. currently
Linear solve converged due to CONVERGED_RTOL iterations 7
Linear solve converged due to CONVERGED_RTOL iterations 9
Linear solve converged due to CONVERGED_RTOL iterations 10
Do we change this? Or should we force OMP_NUM_THREADS=1 for make
test?
Thanks,
Pierre
On 13 Mar 2021, at 2:26 PM, Mark Adams <[email protected]
<mailto:[email protected]>> wrote:
Hypre uses a multiplicative smoother by default. It has a
chebyshev smoother. That with a Jacobi PC should be thread
invariant.
Mark
On Sat, Mar 13, 2021 at 8:18 AM Pierre Jolivet <[email protected]
<mailto:[email protected]>> wrote:
On 13 Mar 2021, at 9:17 AM, Pierre Jolivet
<[email protected] <mailto:[email protected]>> wrote:
Hello Eric,
I’ve made an “interesting” discovery, so I’ll put back the
list in c/c.
It appears the following snippet of code which uses
Allreduce() + lambda function + MPI_IN_PLACE is:
- Valgrind-clean with MPICH;
- Valgrind-clean with OpenMPI 4.0.5;
- not Valgrind-clean with OpenMPI 4.1.0.
I’m not sure who is to blame here, I’ll need to look at
the MPI specification for what is required by the
implementors and users in that case.
In the meantime, I’ll do the following:
- update config/BuildSystem/config/packages/OpenMPI.py to
use OpenMPI 4.1.0, see if any other error appears;
- provide a hotfix to bypass the segfaults;
I can confirm that splitting the single Allreduce with my
own MPI_Op into two Allreduce with MAX and BAND fixes the
segfaults with OpenMPI (*).
- look at the hypre issue and whether they should be
deferred to the hypre team.
I don’t know if there is something wrong in hypre threading
or if it’s just a side effect of threading, but it seems
that the number of threads has a drastic effect on the
quality of the PC.
By default, it looks that there are two threads per process
with your Docker image.
If I force OMP_NUM_THREADS=1, then I get the same
convergence as in the output file.
Thanks,
Pierre
(*) https://gitlab.com/petsc/petsc/-/merge_requests/3712
<https://gitlab.com/petsc/petsc/-/merge_requests/3712>
Thank you for the Docker files, they were really useful.
If you want to avoid oversubscription failures, you can
edit the file
/opt/openmpi-4.1.0/etc/openmpi-default-hostfile and append
the line:
localhost slots=12
If you want to increase the timeout limit of PETSc test
suite for each test, you can add the extra flag in your
command line TIMEOUT=180 (default is 60, units are seconds).
Thanks, I’ll ping you on GitLab when I’ve got something
ready for you to try,
Pierre
<ompi.cxx>
On 12 Mar 2021, at 8:54 PM, Eric Chamberland
<[email protected]
<mailto:[email protected]>> wrote:
Hi Pierre,
I now have a docker container reproducing the problems here.
Actually, if I look at
snes_tutorials-ex12_quad_singular_hpddm it fails like this:
not ok snes_tutorials-ex12_quad_singular_hpddm # Error
code: 59
# Initial guess
# L_2 Error: 0.00803099
# Initial Residual
# L_2 Residual: 1.09057
# Au - b = Au + F(0)
# Linear L_2 Residual: 1.09057
# [d470c54ce086:14127] Read -1, expected 4096, errno = 1
# [d470c54ce086:14128] Read -1, expected 4096, errno = 1
# [d470c54ce086:14129] Read -1, expected 4096, errno = 1
# [3]PETSC ERROR:
------------------------------------------------------------------------
# [3]PETSC ERROR: Caught signal number 11 SEGV:
Segmentation Violation, probably memory access out of range
# [3]PETSC ERROR: Try option -start_in_debugger or
-on_error_attach_debugger
# [3]PETSC ERROR: or see
https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
<https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
# [3]PETSC ERROR: or try http://valgrind.org
<http://valgrind.org/> on GNU/linux and Apple Mac OS X to
find memory corruption errors
# [3]PETSC ERROR: likely location of problem given in
stack below
# [3]PETSC ERROR: --------------------- Stack Frames
------------------------------------
# [3]PETSC ERROR: Note: The EXACT line numbers in the
stack are not available,
# [3]PETSC ERROR: INSTEAD the line number of the start of
the function
# [3]PETSC ERROR: is given.
# [3]PETSC ERROR: [3] buildTwo line 987
/opt/petsc-main/include/HPDDM_schwarz.hpp
# [3]PETSC ERROR: [3] next line 1130
/opt/petsc-main/include/HPDDM_schwarz.hpp
# [3]PETSC ERROR: --------------------- Error Message
--------------------------------------------------------------
# [3]PETSC ERROR: Signal received
# [3]PETSC ERROR: [0]PETSC ERROR:
------------------------------------------------------------------------
also ex12_quad_hpddm_reuse_baij fails with a lot more
"Read -1, expected ..." which I don't know where they
come from...?
Hypre (like in diff-snes_tutorials-ex56_hypre) is also
having DIVERGED_INDEFINITE_PC failures...
Please see the 3 attached docker files:
1) fedora_mkl_and_devtools : the DockerFile which install
fedore 33 with gnu compilers and MKL and everything to
develop.
2) openmpi: the DockerFile to bluid OpenMPI
3) petsc: The las DockerFile that build/install and test
PETSc
I build the 3 like this:
docker build -t fedora_mkl_and_devtools -f
fedora_mkl_and_devtools .
docker build -t openmpi -f openmpi .
docker build -t petsc -f petsc .
Disclaimer: I am not a docker expert, so I may do things
that are not docker-stat-of-the-art but I am opened to
suggestions... ;)
I have just ran it on my portable (long) which have not
enough cores, so many more tests failed (should force
--oversubscribe but don't know how to). I will relaunch
on my workstation in a few minutes.
I will now test your branch! (sorry for the delay).
Thanks,
Eric
On 2021-03-11 9:03 a.m., Eric Chamberland wrote:
Hi Pierre,
ok, that's interesting!
I will try to build a docker image until tomorrow and
give you the exact recipe to reproduce the bugs.
Eric
On 2021-03-11 2:46 a.m., Pierre Jolivet wrote:
On 11 Mar 2021, at 6:16 AM, Barry Smith
<[email protected] <mailto:[email protected]>> wrote:
Eric,
Sorry about not being more immediate. We still have
this in our active email so you don't need to submit
individual issues. We'll try to get to them as soon as
we can.
Indeed, I’m still trying to figure this out.
I realized that some of my configure flags were
different than yours, e.g., no --with-memalign.
I’ve also added SuperLU_DIST to my installation.
Still, I can’t reproduce any issue.
I will continue looking into this, it appears I’m
seeing some valgrind errors, but I don’t know if this
is some side effect of OpenMPI not being valgrind-clean
(last time I checked, there was no error with MPICH).
Thank you for your patience,
Pierre
/usr/bin/gmake -f gmakefile test test-fail=1
Using MAKEFLAGS: test-fail=1
TEST
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_baij.counts
ok snes_tutorials-ex12_quad_hpddm_reuse_baij
ok diff-snes_tutorials-ex12_quad_hpddm_reuse_baij
TEST
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist_2.counts
ok ksp_ksp_tests-ex33_superlu_dist_2
ok diff-ksp_ksp_tests-ex33_superlu_dist_2
TEST
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex49_superlu_dist.counts
ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
ok
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
ok
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
ok
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
ok
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
ok
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
ok
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
ok
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
ok
diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
TEST
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex50_tut_2.counts
ok ksp_ksp_tutorials-ex50_tut_2
ok diff-ksp_ksp_tutorials-ex50_tut_2
TEST
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist.counts
ok ksp_ksp_tests-ex33_superlu_dist
ok diff-ksp_ksp_tests-ex33_superlu_dist
TEST
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_hypre.counts
ok snes_tutorials-ex56_hypre
ok diff-snes_tutorials-ex56_hypre
TEST
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex56_2.counts
ok ksp_ksp_tutorials-ex56_2
ok diff-ksp_ksp_tutorials-ex56_2
TEST
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_elas.counts
ok snes_tutorials-ex17_3d_q3_trig_elas
ok diff-snes_tutorials-ex17_3d_q3_trig_elas
TEST
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij.counts
ok snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
ok
diff-snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
TEST
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_3.counts
not ok ksp_ksp_tutorials-ex5_superlu_dist_3 # Error code: 1
#srun: error: Unable to create step for job 1426755:
More processors requested than permitted
ok ksp_ksp_tutorials-ex5_superlu_dist_3 # SKIP Command
failed so no diff
TEST
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist.counts
ok ksp_ksp_tutorials-ex5f_superlu_dist # SKIP Fortran
required for this test
TEST
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_tri_parmetis_hpddm_baij.counts
ok snes_tutorials-ex12_tri_parmetis_hpddm_baij
ok diff-snes_tutorials-ex12_tri_parmetis_hpddm_baij
TEST
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_tut_3.counts
ok snes_tutorials-ex19_tut_3
ok diff-snes_tutorials-ex19_tut_3
TEST
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_vlap.counts
ok snes_tutorials-ex17_3d_q3_trig_vlap
ok diff-snes_tutorials-ex17_3d_q3_trig_vlap
TEST
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_3.counts
ok ksp_ksp_tutorials-ex5f_superlu_dist_3 # SKIP
Fortran required for this test
TEST
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist.counts
ok snes_tutorials-ex19_superlu_dist
ok diff-snes_tutorials-ex19_superlu_dist
TEST
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre.counts
ok
snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
ok
diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
TEST
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex49_hypre_nullspace.counts
ok ksp_ksp_tutorials-ex49_hypre_nullspace
ok diff-ksp_ksp_tutorials-ex49_hypre_nullspace
TEST
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist_2.counts
ok snes_tutorials-ex19_superlu_dist_2
ok diff-snes_tutorials-ex19_superlu_dist_2
TEST
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_2.counts
not ok ksp_ksp_tutorials-ex5_superlu_dist_2 # Error code: 1
#srun: error: Unable to create step for job 1426755:
More processors requested than permitted
ok ksp_ksp_tutorials-ex5_superlu_dist_2 # SKIP Command
failed so no diff
TEST
arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre.counts
ok
snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
ok
diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
TEST
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex64_1.counts
ok ksp_ksp_tutorials-ex64_1
ok diff-ksp_ksp_tutorials-ex64_1
TEST
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist.counts
not ok ksp_ksp_tutorials-ex5_superlu_dist # Error code: 1
#srun: error: Unable to create step for job 1426755:
More processors requested than permitted
ok ksp_ksp_tutorials-ex5_superlu_dist # SKIP Command
failed so no diff
TEST
arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_2.counts
ok ksp_ksp_tutorials-ex5f_superlu_dist_2 # SKIP
Fortran required for this test
Barry
On Mar 10, 2021, at 11:03 PM, Eric Chamberland
<[email protected]
<mailto:[email protected]>> wrote:
Barry,
to get a some follow up on --with-openmp=1 failures,
shall I open gitlab issues for:
a) all hypre failures giving DIVERGED_INDEFINITE_PC
b) all superlu_dist failures giving different results
with initia and "Exceeded timeout limit of 60 s"
c) hpddm failures "free(): invalid next size (fast)"
and "Segmentation Violation"
d) all tao's "Exceeded timeout limit of 60 s"
I don't see how I could do all these debugging by
myself...
Thanks,
Eric
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42
<fedora_mkl_and_devtools.txt><openmpi.txt><petsc.txt>
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42