Eric You should report these HYPRE issues upstream https://github.com/hypre-space/hypre/issues <https://github.com/hypre-space/hypre/issues>
> On Mar 14, 2021, at 3:44 AM, Eric Chamberland > <[email protected]> wrote: > > For us it clearly creates problems in real computations... > > I understand the need to have clean test for PETSc, but for me, it reveals > that hypre isn't usable with more than one thread for now... > > Another solution: force single-threaded configuration for hypre until this > is fixed? > > Eric > > On 2021-03-13 8:50 a.m., Pierre Jolivet wrote: >> -pc_hypre_boomeramg_relax_type_all Jacobi => >> Linear solve did not converge due to DIVERGED_INDEFINITE_PC iterations 3 >> -pc_hypre_boomeramg_relax_type_all l1scaled-Jacobi => >> OK, independently of the architecture it seems (Eric Docker image with 1 or >> 2 threads or my macOS), but contraction factor is higher >> Linear solve converged due to CONVERGED_RTOL iterations 8 >> Linear solve converged due to CONVERGED_RTOL iterations 24 >> Linear solve converged due to CONVERGED_RTOL iterations 26 >> v. currently >> Linear solve converged due to CONVERGED_RTOL iterations 7 >> Linear solve converged due to CONVERGED_RTOL iterations 9 >> Linear solve converged due to CONVERGED_RTOL iterations 10 >> >> Do we change this? Or should we force OMP_NUM_THREADS=1 for make test? >> >> Thanks, >> Pierre >> >>> On 13 Mar 2021, at 2:26 PM, Mark Adams <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Hypre uses a multiplicative smoother by default. It has a chebyshev >>> smoother. That with a Jacobi PC should be thread invariant. >>> Mark >>> >>> On Sat, Mar 13, 2021 at 8:18 AM Pierre Jolivet <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>>> On 13 Mar 2021, at 9:17 AM, Pierre Jolivet <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Hello Eric, >>>> I’ve made an “interesting” discovery, so I’ll put back the list in c/c. >>>> It appears the following snippet of code which uses Allreduce() + lambda >>>> function + MPI_IN_PLACE is: >>>> - Valgrind-clean with MPICH; >>>> - Valgrind-clean with OpenMPI 4.0.5; >>>> - not Valgrind-clean with OpenMPI 4.1.0. >>>> I’m not sure who is to blame here, I’ll need to look at the MPI >>>> specification for what is required by the implementors and users in that >>>> case. >>>> >>>> In the meantime, I’ll do the following: >>>> - update config/BuildSystem/config/packages/OpenMPI.py to use OpenMPI >>>> 4.1.0, see if any other error appears; >>>> - provide a hotfix to bypass the segfaults; >>> >>> I can confirm that splitting the single Allreduce with my own MPI_Op into >>> two Allreduce with MAX and BAND fixes the segfaults with OpenMPI (*). >>> >>>> - look at the hypre issue and whether they should be deferred to the hypre >>>> team. >>> >>> I don’t know if there is something wrong in hypre threading or if it’s just >>> a side effect of threading, but it seems that the number of threads has a >>> drastic effect on the quality of the PC. >>> By default, it looks that there are two threads per process with your >>> Docker image. >>> If I force OMP_NUM_THREADS=1, then I get the same convergence as in the >>> output file. >>> >>> Thanks, >>> Pierre >>> >>> (*) https://gitlab.com/petsc/petsc/-/merge_requests/3712 >>> <https://gitlab.com/petsc/petsc/-/merge_requests/3712> >>>> Thank you for the Docker files, they were really useful. >>>> If you want to avoid oversubscription failures, you can edit the file >>>> /opt/openmpi-4.1.0/etc/openmpi-default-hostfile and append the line: >>>> localhost slots=12 >>>> If you want to increase the timeout limit of PETSc test suite for each >>>> test, you can add the extra flag in your command line TIMEOUT=180 (default >>>> is 60, units are seconds). >>>> >>>> Thanks, I’ll ping you on GitLab when I’ve got something ready for you to >>>> try, >>>> Pierre >>>> >>>> <ompi.cxx> >>>> >>>>> On 12 Mar 2021, at 8:54 PM, Eric Chamberland >>>>> <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> Hi Pierre, >>>>> >>>>> I now have a docker container reproducing the problems here. >>>>> >>>>> Actually, if I look at snes_tutorials-ex12_quad_singular_hpddm it fails >>>>> like this: >>>>> >>>>> not ok snes_tutorials-ex12_quad_singular_hpddm # Error code: 59 >>>>> # Initial guess >>>>> # L_2 Error: 0.00803099 >>>>> # Initial Residual >>>>> # L_2 Residual: 1.09057 >>>>> # Au - b = Au + F(0) >>>>> # Linear L_2 Residual: 1.09057 >>>>> # [d470c54ce086:14127] Read -1, expected 4096, errno = 1 >>>>> # [d470c54ce086:14128] Read -1, expected 4096, errno = 1 >>>>> # [d470c54ce086:14129] Read -1, expected 4096, errno = 1 >>>>> # [3]PETSC ERROR: >>>>> ------------------------------------------------------------------------ >>>>> # [3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation >>>>> Violation, probably memory access out of range >>>>> # [3]PETSC ERROR: Try option -start_in_debugger or >>>>> -on_error_attach_debugger >>>>> # [3]PETSC ERROR: or see >>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >>>>> <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind> >>>>> # [3]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> >>>>> on GNU/linux and Apple Mac OS X to find memory corruption errors >>>>> # [3]PETSC ERROR: likely location of problem given in stack below >>>>> # [3]PETSC ERROR: --------------------- Stack Frames >>>>> ------------------------------------ >>>>> # [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not >>>>> available, >>>>> # [3]PETSC ERROR: INSTEAD the line number of the start of the >>>>> function >>>>> # [3]PETSC ERROR: is given. >>>>> # [3]PETSC ERROR: [3] buildTwo line 987 >>>>> /opt/petsc-main/include/HPDDM_schwarz.hpp >>>>> # [3]PETSC ERROR: [3] next line 1130 >>>>> /opt/petsc-main/include/HPDDM_schwarz.hpp >>>>> # [3]PETSC ERROR: --------------------- Error Message >>>>> -------------------------------------------------------------- >>>>> # [3]PETSC ERROR: Signal received >>>>> # [3]PETSC ERROR: [0]PETSC ERROR: >>>>> ------------------------------------------------------------------------ >>>>> >>>>> also ex12_quad_hpddm_reuse_baij fails with a lot more "Read -1, expected >>>>> ..." which I don't know where they come from...? >>>>> >>>>> Hypre (like in diff-snes_tutorials-ex56_hypre) is also having >>>>> DIVERGED_INDEFINITE_PC failures... >>>>> >>>>> Please see the 3 attached docker files: >>>>> >>>>> 1) fedora_mkl_and_devtools : the DockerFile which install fedore 33 with >>>>> gnu compilers and MKL and everything to develop. >>>>> >>>>> 2) openmpi: the DockerFile to bluid OpenMPI >>>>> >>>>> 3) petsc: The las DockerFile that build/install and test PETSc >>>>> >>>>> I build the 3 like this: >>>>> >>>>> docker build -t fedora_mkl_and_devtools -f fedora_mkl_and_devtools . >>>>> >>>>> docker build -t openmpi -f openmpi . >>>>> >>>>> docker build -t petsc -f petsc . >>>>> >>>>> Disclaimer: I am not a docker expert, so I may do things that are not >>>>> docker-stat-of-the-art but I am opened to suggestions... ;) >>>>> >>>>> I have just ran it on my portable (long) which have not enough cores, so >>>>> many more tests failed (should force --oversubscribe but don't know how >>>>> to). I will relaunch on my workstation in a few minutes. >>>>> >>>>> I will now test your branch! (sorry for the delay). >>>>> >>>>> Thanks, >>>>> >>>>> Eric >>>>> >>>>> On 2021-03-11 9:03 a.m., Eric Chamberland wrote: >>>>>> Hi Pierre, >>>>>> >>>>>> ok, that's interesting! >>>>>> >>>>>> I will try to build a docker image until tomorrow and give you the exact >>>>>> recipe to reproduce the bugs. >>>>>> >>>>>> Eric >>>>>> >>>>>> >>>>>> >>>>>> On 2021-03-11 2:46 a.m., Pierre Jolivet wrote: >>>>>>> >>>>>>> >>>>>>>> On 11 Mar 2021, at 6:16 AM, Barry Smith <[email protected] >>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> Eric, >>>>>>>> >>>>>>>> Sorry about not being more immediate. We still have this in our >>>>>>>> active email so you don't need to submit individual issues. We'll try >>>>>>>> to get to them as soon as we can. >>>>>>> >>>>>>> Indeed, I’m still trying to figure this out. >>>>>>> I realized that some of my configure flags were different than yours, >>>>>>> e.g., no --with-memalign. >>>>>>> I’ve also added SuperLU_DIST to my installation. >>>>>>> Still, I can’t reproduce any issue. >>>>>>> I will continue looking into this, it appears I’m seeing some valgrind >>>>>>> errors, but I don’t know if this is some side effect of OpenMPI not >>>>>>> being valgrind-clean (last time I checked, there was no error with >>>>>>> MPICH). >>>>>>> >>>>>>> Thank you for your patience, >>>>>>> Pierre >>>>>>> >>>>>>> /usr/bin/gmake -f gmakefile test test-fail=1 >>>>>>> Using MAKEFLAGS: test-fail=1 >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_baij.counts >>>>>>> ok snes_tutorials-ex12_quad_hpddm_reuse_baij >>>>>>> ok diff-snes_tutorials-ex12_quad_hpddm_reuse_baij >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist_2.counts >>>>>>> ok ksp_ksp_tests-ex33_superlu_dist_2 >>>>>>> ok diff-ksp_ksp_tests-ex33_superlu_dist_2 >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex49_superlu_dist.counts >>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0 >>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0 >>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1 >>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1 >>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0 >>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0 >>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1 >>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1 >>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0 >>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0 >>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1 >>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1 >>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0 >>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0 >>>>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1 >>>>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1 >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex50_tut_2.counts >>>>>>> ok ksp_ksp_tutorials-ex50_tut_2 >>>>>>> ok diff-ksp_ksp_tutorials-ex50_tut_2 >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist.counts >>>>>>> ok ksp_ksp_tests-ex33_superlu_dist >>>>>>> ok diff-ksp_ksp_tests-ex33_superlu_dist >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_hypre.counts >>>>>>> ok snes_tutorials-ex56_hypre >>>>>>> ok diff-snes_tutorials-ex56_hypre >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex56_2.counts >>>>>>> ok ksp_ksp_tutorials-ex56_2 >>>>>>> ok diff-ksp_ksp_tutorials-ex56_2 >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_elas.counts >>>>>>> ok snes_tutorials-ex17_3d_q3_trig_elas >>>>>>> ok diff-snes_tutorials-ex17_3d_q3_trig_elas >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij.counts >>>>>>> ok snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij >>>>>>> ok diff-snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_3.counts >>>>>>> not ok ksp_ksp_tutorials-ex5_superlu_dist_3 # Error code: 1 >>>>>>> # srun: error: Unable to create step for job 1426755: More >>>>>>> processors requested than permitted >>>>>>> ok ksp_ksp_tutorials-ex5_superlu_dist_3 # SKIP Command failed so no >>>>>>> diff >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist.counts >>>>>>> ok ksp_ksp_tutorials-ex5f_superlu_dist # SKIP Fortran required for >>>>>>> this test >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_tri_parmetis_hpddm_baij.counts >>>>>>> ok snes_tutorials-ex12_tri_parmetis_hpddm_baij >>>>>>> ok diff-snes_tutorials-ex12_tri_parmetis_hpddm_baij >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_tut_3.counts >>>>>>> ok snes_tutorials-ex19_tut_3 >>>>>>> ok diff-snes_tutorials-ex19_tut_3 >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_vlap.counts >>>>>>> ok snes_tutorials-ex17_3d_q3_trig_vlap >>>>>>> ok diff-snes_tutorials-ex17_3d_q3_trig_vlap >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_3.counts >>>>>>> ok ksp_ksp_tutorials-ex5f_superlu_dist_3 # SKIP Fortran required for >>>>>>> this test >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist.counts >>>>>>> ok snes_tutorials-ex19_superlu_dist >>>>>>> ok diff-snes_tutorials-ex19_superlu_dist >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre.counts >>>>>>> ok snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre >>>>>>> ok >>>>>>> diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex49_hypre_nullspace.counts >>>>>>> ok ksp_ksp_tutorials-ex49_hypre_nullspace >>>>>>> ok diff-ksp_ksp_tutorials-ex49_hypre_nullspace >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist_2.counts >>>>>>> ok snes_tutorials-ex19_superlu_dist_2 >>>>>>> ok diff-snes_tutorials-ex19_superlu_dist_2 >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_2.counts >>>>>>> not ok ksp_ksp_tutorials-ex5_superlu_dist_2 # Error code: 1 >>>>>>> # srun: error: Unable to create step for job 1426755: More >>>>>>> processors requested than permitted >>>>>>> ok ksp_ksp_tutorials-ex5_superlu_dist_2 # SKIP Command failed so no >>>>>>> diff >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre.counts >>>>>>> ok snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre >>>>>>> ok >>>>>>> diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex64_1.counts >>>>>>> ok ksp_ksp_tutorials-ex64_1 >>>>>>> ok diff-ksp_ksp_tutorials-ex64_1 >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist.counts >>>>>>> not ok ksp_ksp_tutorials-ex5_superlu_dist # Error code: 1 >>>>>>> # srun: error: Unable to create step for job 1426755: More >>>>>>> processors requested than permitted >>>>>>> ok ksp_ksp_tutorials-ex5_superlu_dist # SKIP Command failed so no diff >>>>>>> TEST >>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_2.counts >>>>>>> ok ksp_ksp_tutorials-ex5f_superlu_dist_2 # SKIP Fortran required for >>>>>>> this test >>>>>>> >>>>>>>> Barry >>>>>>>> >>>>>>>> >>>>>>>>> On Mar 10, 2021, at 11:03 PM, Eric Chamberland >>>>>>>>> <[email protected] >>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>> >>>>>>>>> Barry, >>>>>>>>> >>>>>>>>> to get a some follow up on --with-openmp=1 failures, shall I open >>>>>>>>> gitlab issues for: >>>>>>>>> >>>>>>>>> a) all hypre failures giving DIVERGED_INDEFINITE_PC >>>>>>>>> >>>>>>>>> b) all superlu_dist failures giving different results with initia and >>>>>>>>> "Exceeded timeout limit of 60 s" >>>>>>>>> >>>>>>>>> c) hpddm failures "free(): invalid next size (fast)" and >>>>>>>>> "Segmentation Violation" >>>>>>>>> >>>>>>>>> d) all tao's "Exceeded timeout limit of 60 s" >>>>>>>>> >>>>>>>>> I don't see how I could do all these debugging by myself... >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Eric >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> -- >>>>>> Eric Chamberland, ing., M. Ing >>>>>> Professionnel de recherche >>>>>> GIREF/Université Laval >>>>>> (418) 656-2131 poste 41 22 42 >>>>> -- >>>>> Eric Chamberland, ing., M. Ing >>>>> Professionnel de recherche >>>>> GIREF/Université Laval >>>>> (418) 656-2131 poste 41 22 42 >>>>> <fedora_mkl_and_devtools.txt><openmpi.txt><petsc.txt> >>>> >>> >> > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Université Laval > (418) 656-2131 poste 41 22 42
