-pc_hypre_boomeramg_relax_type_all Jacobi => Linear solve did not converge due to DIVERGED_INDEFINITE_PC iterations 3 -pc_hypre_boomeramg_relax_type_all l1scaled-Jacobi => OK, independently of the architecture it seems (Eric Docker image with 1 or 2 threads or my macOS), but contraction factor is higher Linear solve converged due to CONVERGED_RTOL iterations 8 Linear solve converged due to CONVERGED_RTOL iterations 24 Linear solve converged due to CONVERGED_RTOL iterations 26 v. currently Linear solve converged due to CONVERGED_RTOL iterations 7 Linear solve converged due to CONVERGED_RTOL iterations 9 Linear solve converged due to CONVERGED_RTOL iterations 10
Do we change this? Or should we force OMP_NUM_THREADS=1 for make test? Thanks, Pierre > On 13 Mar 2021, at 2:26 PM, Mark Adams <[email protected]> wrote: > > Hypre uses a multiplicative smoother by default. It has a chebyshev smoother. > That with a Jacobi PC should be thread invariant. > Mark > > On Sat, Mar 13, 2021 at 8:18 AM Pierre Jolivet <[email protected] > <mailto:[email protected]>> wrote: > >> On 13 Mar 2021, at 9:17 AM, Pierre Jolivet <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hello Eric, >> I’ve made an “interesting” discovery, so I’ll put back the list in c/c. >> It appears the following snippet of code which uses Allreduce() + lambda >> function + MPI_IN_PLACE is: >> - Valgrind-clean with MPICH; >> - Valgrind-clean with OpenMPI 4.0.5; >> - not Valgrind-clean with OpenMPI 4.1.0. >> I’m not sure who is to blame here, I’ll need to look at the MPI >> specification for what is required by the implementors and users in that >> case. >> >> In the meantime, I’ll do the following: >> - update config/BuildSystem/config/packages/OpenMPI.py to use OpenMPI 4.1.0, >> see if any other error appears; >> - provide a hotfix to bypass the segfaults; > > I can confirm that splitting the single Allreduce with my own MPI_Op into two > Allreduce with MAX and BAND fixes the segfaults with OpenMPI (*). > >> - look at the hypre issue and whether they should be deferred to the hypre >> team. > > I don’t know if there is something wrong in hypre threading or if it’s just a > side effect of threading, but it seems that the number of threads has a > drastic effect on the quality of the PC. > By default, it looks that there are two threads per process with your Docker > image. > If I force OMP_NUM_THREADS=1, then I get the same convergence as in the > output file. > > Thanks, > Pierre > > (*) https://gitlab.com/petsc/petsc/-/merge_requests/3712 > <https://gitlab.com/petsc/petsc/-/merge_requests/3712> >> Thank you for the Docker files, they were really useful. >> If you want to avoid oversubscription failures, you can edit the file >> /opt/openmpi-4.1.0/etc/openmpi-default-hostfile and append the line: >> localhost slots=12 >> If you want to increase the timeout limit of PETSc test suite for each test, >> you can add the extra flag in your command line TIMEOUT=180 (default is 60, >> units are seconds). >> >> Thanks, I’ll ping you on GitLab when I’ve got something ready for you to try, >> Pierre >> >> <ompi.cxx> >> >>> On 12 Mar 2021, at 8:54 PM, Eric Chamberland >>> <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Hi Pierre, >>> >>> I now have a docker container reproducing the problems here. >>> >>> Actually, if I look at snes_tutorials-ex12_quad_singular_hpddm it fails >>> like this: >>> >>> not ok snes_tutorials-ex12_quad_singular_hpddm # Error code: 59 >>> # Initial guess >>> # L_2 Error: 0.00803099 >>> # Initial Residual >>> # L_2 Residual: 1.09057 >>> # Au - b = Au + F(0) >>> # Linear L_2 Residual: 1.09057 >>> # [d470c54ce086:14127] Read -1, expected 4096, errno = 1 >>> # [d470c54ce086:14128] Read -1, expected 4096, errno = 1 >>> # [d470c54ce086:14129] Read -1, expected 4096, errno = 1 >>> # [3]PETSC ERROR: >>> ------------------------------------------------------------------------ >>> # [3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation >>> Violation, probably memory access out of range >>> # [3]PETSC ERROR: Try option -start_in_debugger or >>> -on_error_attach_debugger >>> # [3]PETSC ERROR: or see >>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >>> <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind> >>> # [3]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> >>> on GNU/linux and Apple Mac OS X to find memory corruption errors >>> # [3]PETSC ERROR: likely location of problem given in stack below >>> # [3]PETSC ERROR: --------------------- Stack Frames >>> ------------------------------------ >>> # [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not >>> available, >>> # [3]PETSC ERROR: INSTEAD the line number of the start of the >>> function >>> # [3]PETSC ERROR: is given. >>> # [3]PETSC ERROR: [3] buildTwo line 987 >>> /opt/petsc-main/include/HPDDM_schwarz.hpp >>> # [3]PETSC ERROR: [3] next line 1130 >>> /opt/petsc-main/include/HPDDM_schwarz.hpp >>> # [3]PETSC ERROR: --------------------- Error Message >>> -------------------------------------------------------------- >>> # [3]PETSC ERROR: Signal received >>> # [3]PETSC ERROR: [0]PETSC ERROR: >>> ------------------------------------------------------------------------ >>> >>> also ex12_quad_hpddm_reuse_baij fails with a lot more "Read -1, expected >>> ..." which I don't know where they come from...? >>> >>> Hypre (like in diff-snes_tutorials-ex56_hypre) is also having >>> DIVERGED_INDEFINITE_PC failures... >>> >>> Please see the 3 attached docker files: >>> >>> 1) fedora_mkl_and_devtools : the DockerFile which install fedore 33 with >>> gnu compilers and MKL and everything to develop. >>> >>> 2) openmpi: the DockerFile to bluid OpenMPI >>> >>> 3) petsc: The las DockerFile that build/install and test PETSc >>> >>> I build the 3 like this: >>> >>> docker build -t fedora_mkl_and_devtools -f fedora_mkl_and_devtools . >>> >>> docker build -t openmpi -f openmpi . >>> >>> docker build -t petsc -f petsc . >>> >>> Disclaimer: I am not a docker expert, so I may do things that are not >>> docker-stat-of-the-art but I am opened to suggestions... ;) >>> >>> I have just ran it on my portable (long) which have not enough cores, so >>> many more tests failed (should force --oversubscribe but don't know how >>> to). I will relaunch on my workstation in a few minutes. >>> >>> I will now test your branch! (sorry for the delay). >>> >>> Thanks, >>> >>> Eric >>> >>> On 2021-03-11 9:03 a.m., Eric Chamberland wrote: >>>> Hi Pierre, >>>> >>>> ok, that's interesting! >>>> >>>> I will try to build a docker image until tomorrow and give you the exact >>>> recipe to reproduce the bugs. >>>> >>>> Eric >>>> >>>> >>>> >>>> On 2021-03-11 2:46 a.m., Pierre Jolivet wrote: >>>>> >>>>> >>>>>> On 11 Mar 2021, at 6:16 AM, Barry Smith <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> >>>>>> >>>>>> Eric, >>>>>> >>>>>> Sorry about not being more immediate. We still have this in our >>>>>> active email so you don't need to submit individual issues. We'll try to >>>>>> get to them as soon as we can. >>>>> >>>>> Indeed, I’m still trying to figure this out. >>>>> I realized that some of my configure flags were different than yours, >>>>> e.g., no --with-memalign. >>>>> I’ve also added SuperLU_DIST to my installation. >>>>> Still, I can’t reproduce any issue. >>>>> I will continue looking into this, it appears I’m seeing some valgrind >>>>> errors, but I don’t know if this is some side effect of OpenMPI not being >>>>> valgrind-clean (last time I checked, there was no error with MPICH). >>>>> >>>>> Thank you for your patience, >>>>> Pierre >>>>> >>>>> /usr/bin/gmake -f gmakefile test test-fail=1 >>>>> Using MAKEFLAGS: test-fail=1 >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_baij.counts >>>>> ok snes_tutorials-ex12_quad_hpddm_reuse_baij >>>>> ok diff-snes_tutorials-ex12_quad_hpddm_reuse_baij >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist_2.counts >>>>> ok ksp_ksp_tests-ex33_superlu_dist_2 >>>>> ok diff-ksp_ksp_tests-ex33_superlu_dist_2 >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex49_superlu_dist.counts >>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0 >>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0 >>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1 >>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1 >>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0 >>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0 >>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1 >>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1 >>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0 >>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0 >>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1 >>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1 >>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0 >>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0 >>>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1 >>>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1 >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex50_tut_2.counts >>>>> ok ksp_ksp_tutorials-ex50_tut_2 >>>>> ok diff-ksp_ksp_tutorials-ex50_tut_2 >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist.counts >>>>> ok ksp_ksp_tests-ex33_superlu_dist >>>>> ok diff-ksp_ksp_tests-ex33_superlu_dist >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_hypre.counts >>>>> ok snes_tutorials-ex56_hypre >>>>> ok diff-snes_tutorials-ex56_hypre >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex56_2.counts >>>>> ok ksp_ksp_tutorials-ex56_2 >>>>> ok diff-ksp_ksp_tutorials-ex56_2 >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_elas.counts >>>>> ok snes_tutorials-ex17_3d_q3_trig_elas >>>>> ok diff-snes_tutorials-ex17_3d_q3_trig_elas >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij.counts >>>>> ok snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij >>>>> ok diff-snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_3.counts >>>>> not ok ksp_ksp_tutorials-ex5_superlu_dist_3 # Error code: 1 >>>>> # srun: error: Unable to create step for job 1426755: More processors >>>>> requested than permitted >>>>> ok ksp_ksp_tutorials-ex5_superlu_dist_3 # SKIP Command failed so no diff >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist.counts >>>>> ok ksp_ksp_tutorials-ex5f_superlu_dist # SKIP Fortran required for this >>>>> test >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_tri_parmetis_hpddm_baij.counts >>>>> ok snes_tutorials-ex12_tri_parmetis_hpddm_baij >>>>> ok diff-snes_tutorials-ex12_tri_parmetis_hpddm_baij >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_tut_3.counts >>>>> ok snes_tutorials-ex19_tut_3 >>>>> ok diff-snes_tutorials-ex19_tut_3 >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_vlap.counts >>>>> ok snes_tutorials-ex17_3d_q3_trig_vlap >>>>> ok diff-snes_tutorials-ex17_3d_q3_trig_vlap >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_3.counts >>>>> ok ksp_ksp_tutorials-ex5f_superlu_dist_3 # SKIP Fortran required for >>>>> this test >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist.counts >>>>> ok snes_tutorials-ex19_superlu_dist >>>>> ok diff-snes_tutorials-ex19_superlu_dist >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre.counts >>>>> ok snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre >>>>> ok diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex49_hypre_nullspace.counts >>>>> ok ksp_ksp_tutorials-ex49_hypre_nullspace >>>>> ok diff-ksp_ksp_tutorials-ex49_hypre_nullspace >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist_2.counts >>>>> ok snes_tutorials-ex19_superlu_dist_2 >>>>> ok diff-snes_tutorials-ex19_superlu_dist_2 >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_2.counts >>>>> not ok ksp_ksp_tutorials-ex5_superlu_dist_2 # Error code: 1 >>>>> # srun: error: Unable to create step for job 1426755: More processors >>>>> requested than permitted >>>>> ok ksp_ksp_tutorials-ex5_superlu_dist_2 # SKIP Command failed so no diff >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre.counts >>>>> ok snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre >>>>> ok diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex64_1.counts >>>>> ok ksp_ksp_tutorials-ex64_1 >>>>> ok diff-ksp_ksp_tutorials-ex64_1 >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist.counts >>>>> not ok ksp_ksp_tutorials-ex5_superlu_dist # Error code: 1 >>>>> # srun: error: Unable to create step for job 1426755: More processors >>>>> requested than permitted >>>>> ok ksp_ksp_tutorials-ex5_superlu_dist # SKIP Command failed so no diff >>>>> TEST >>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_2.counts >>>>> ok ksp_ksp_tutorials-ex5f_superlu_dist_2 # SKIP Fortran required for >>>>> this test >>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>>> On Mar 10, 2021, at 11:03 PM, Eric Chamberland >>>>>>> <[email protected] >>>>>>> <mailto:[email protected]>> wrote: >>>>>>> >>>>>>> Barry, >>>>>>> >>>>>>> to get a some follow up on --with-openmp=1 failures, shall I open >>>>>>> gitlab issues for: >>>>>>> >>>>>>> a) all hypre failures giving DIVERGED_INDEFINITE_PC >>>>>>> >>>>>>> b) all superlu_dist failures giving different results with initia and >>>>>>> "Exceeded timeout limit of 60 s" >>>>>>> >>>>>>> c) hpddm failures "free(): invalid next size (fast)" and "Segmentation >>>>>>> Violation" >>>>>>> >>>>>>> d) all tao's "Exceeded timeout limit of 60 s" >>>>>>> >>>>>>> I don't see how I could do all these debugging by myself... >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Eric >>>>>>> >>>>>>> >>>>>> >>>>> >>>> -- >>>> Eric Chamberland, ing., M. Ing >>>> Professionnel de recherche >>>> GIREF/Université Laval >>>> (418) 656-2131 poste 41 22 42 >>> -- >>> Eric Chamberland, ing., M. Ing >>> Professionnel de recherche >>> GIREF/Université Laval >>> (418) 656-2131 poste 41 22 42 >>> <fedora_mkl_and_devtools.txt><openmpi.txt><petsc.txt> >> >
