> On 13 Mar 2021, at 9:17 AM, Pierre Jolivet <[email protected]> wrote: > > Hello Eric, > I’ve made an “interesting” discovery, so I’ll put back the list in c/c. > It appears the following snippet of code which uses Allreduce() + lambda > function + MPI_IN_PLACE is: > - Valgrind-clean with MPICH; > - Valgrind-clean with OpenMPI 4.0.5; > - not Valgrind-clean with OpenMPI 4.1.0. > I’m not sure who is to blame here, I’ll need to look at the MPI specification > for what is required by the implementors and users in that case. > > In the meantime, I’ll do the following: > - update config/BuildSystem/config/packages/OpenMPI.py to use OpenMPI 4.1.0, > see if any other error appears; > - provide a hotfix to bypass the segfaults;
I can confirm that splitting the single Allreduce with my own MPI_Op into two Allreduce with MAX and BAND fixes the segfaults with OpenMPI (*). > - look at the hypre issue and whether they should be deferred to the hypre > team. I don’t know if there is something wrong in hypre threading or if it’s just a side effect of threading, but it seems that the number of threads has a drastic effect on the quality of the PC. By default, it looks that there are two threads per process with your Docker image. If I force OMP_NUM_THREADS=1, then I get the same convergence as in the output file. Thanks, Pierre (*) https://gitlab.com/petsc/petsc/-/merge_requests/3712 <https://gitlab.com/petsc/petsc/-/merge_requests/3712> > Thank you for the Docker files, they were really useful. > If you want to avoid oversubscription failures, you can edit the file > /opt/openmpi-4.1.0/etc/openmpi-default-hostfile and append the line: > localhost slots=12 > If you want to increase the timeout limit of PETSc test suite for each test, > you can add the extra flag in your command line TIMEOUT=180 (default is 60, > units are seconds). > > Thanks, I’ll ping you on GitLab when I’ve got something ready for you to try, > Pierre > > <ompi.cxx> > >> On 12 Mar 2021, at 8:54 PM, Eric Chamberland >> <[email protected] <mailto:[email protected]>> >> wrote: >> >> Hi Pierre, >> >> I now have a docker container reproducing the problems here. >> >> Actually, if I look at snes_tutorials-ex12_quad_singular_hpddm it fails >> like this: >> >> not ok snes_tutorials-ex12_quad_singular_hpddm # Error code: 59 >> # Initial guess >> # L_2 Error: 0.00803099 >> # Initial Residual >> # L_2 Residual: 1.09057 >> # Au - b = Au + F(0) >> # Linear L_2 Residual: 1.09057 >> # [d470c54ce086:14127] Read -1, expected 4096, errno = 1 >> # [d470c54ce086:14128] Read -1, expected 4096, errno = 1 >> # [d470c54ce086:14129] Read -1, expected 4096, errno = 1 >> # [3]PETSC ERROR: >> ------------------------------------------------------------------------ >> # [3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation >> Violation, probably memory access out of range >> # [3]PETSC ERROR: Try option -start_in_debugger or >> -on_error_attach_debugger >> # [3]PETSC ERROR: or see >> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >> <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind> >> # [3]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on >> GNU/linux and Apple Mac OS X to find memory corruption errors >> # [3]PETSC ERROR: likely location of problem given in stack below >> # [3]PETSC ERROR: --------------------- Stack Frames >> ------------------------------------ >> # [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not >> available, >> # [3]PETSC ERROR: INSTEAD the line number of the start of the >> function >> # [3]PETSC ERROR: is given. >> # [3]PETSC ERROR: [3] buildTwo line 987 >> /opt/petsc-main/include/HPDDM_schwarz.hpp >> # [3]PETSC ERROR: [3] next line 1130 >> /opt/petsc-main/include/HPDDM_schwarz.hpp >> # [3]PETSC ERROR: --------------------- Error Message >> -------------------------------------------------------------- >> # [3]PETSC ERROR: Signal received >> # [3]PETSC ERROR: [0]PETSC ERROR: >> ------------------------------------------------------------------------ >> >> also ex12_quad_hpddm_reuse_baij fails with a lot more "Read -1, expected >> ..." which I don't know where they come from...? >> >> Hypre (like in diff-snes_tutorials-ex56_hypre) is also having >> DIVERGED_INDEFINITE_PC failures... >> >> Please see the 3 attached docker files: >> >> 1) fedora_mkl_and_devtools : the DockerFile which install fedore 33 with gnu >> compilers and MKL and everything to develop. >> >> 2) openmpi: the DockerFile to bluid OpenMPI >> >> 3) petsc: The las DockerFile that build/install and test PETSc >> >> I build the 3 like this: >> >> docker build -t fedora_mkl_and_devtools -f fedora_mkl_and_devtools . >> >> docker build -t openmpi -f openmpi . >> >> docker build -t petsc -f petsc . >> >> Disclaimer: I am not a docker expert, so I may do things that are not >> docker-stat-of-the-art but I am opened to suggestions... ;) >> >> I have just ran it on my portable (long) which have not enough cores, so >> many more tests failed (should force --oversubscribe but don't know how to). >> I will relaunch on my workstation in a few minutes. >> >> I will now test your branch! (sorry for the delay). >> >> Thanks, >> >> Eric >> >> On 2021-03-11 9:03 a.m., Eric Chamberland wrote: >>> Hi Pierre, >>> >>> ok, that's interesting! >>> >>> I will try to build a docker image until tomorrow and give you the exact >>> recipe to reproduce the bugs. >>> >>> Eric >>> >>> >>> >>> On 2021-03-11 2:46 a.m., Pierre Jolivet wrote: >>>> >>>> >>>>> On 11 Mar 2021, at 6:16 AM, Barry Smith <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> >>>>> Eric, >>>>> >>>>> Sorry about not being more immediate. We still have this in our active >>>>> email so you don't need to submit individual issues. We'll try to get to >>>>> them as soon as we can. >>>> >>>> Indeed, I’m still trying to figure this out. >>>> I realized that some of my configure flags were different than yours, >>>> e.g., no --with-memalign. >>>> I’ve also added SuperLU_DIST to my installation. >>>> Still, I can’t reproduce any issue. >>>> I will continue looking into this, it appears I’m seeing some valgrind >>>> errors, but I don’t know if this is some side effect of OpenMPI not being >>>> valgrind-clean (last time I checked, there was no error with MPICH). >>>> >>>> Thank you for your patience, >>>> Pierre >>>> >>>> /usr/bin/gmake -f gmakefile test test-fail=1 >>>> Using MAKEFLAGS: test-fail=1 >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_baij.counts >>>> ok snes_tutorials-ex12_quad_hpddm_reuse_baij >>>> ok diff-snes_tutorials-ex12_quad_hpddm_reuse_baij >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist_2.counts >>>> ok ksp_ksp_tests-ex33_superlu_dist_2 >>>> ok diff-ksp_ksp_tests-ex33_superlu_dist_2 >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex49_superlu_dist.counts >>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0 >>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0 >>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1 >>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1 >>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0 >>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0 >>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1 >>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1 >>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0 >>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0 >>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1 >>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1 >>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0 >>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0 >>>> ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1 >>>> ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1 >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex50_tut_2.counts >>>> ok ksp_ksp_tutorials-ex50_tut_2 >>>> ok diff-ksp_ksp_tutorials-ex50_tut_2 >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist.counts >>>> ok ksp_ksp_tests-ex33_superlu_dist >>>> ok diff-ksp_ksp_tests-ex33_superlu_dist >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_hypre.counts >>>> ok snes_tutorials-ex56_hypre >>>> ok diff-snes_tutorials-ex56_hypre >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex56_2.counts >>>> ok ksp_ksp_tutorials-ex56_2 >>>> ok diff-ksp_ksp_tutorials-ex56_2 >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_elas.counts >>>> ok snes_tutorials-ex17_3d_q3_trig_elas >>>> ok diff-snes_tutorials-ex17_3d_q3_trig_elas >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij.counts >>>> ok snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij >>>> ok diff-snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_3.counts >>>> not ok ksp_ksp_tutorials-ex5_superlu_dist_3 # Error code: 1 >>>> # srun: error: Unable to create step for job 1426755: More processors >>>> requested than permitted >>>> ok ksp_ksp_tutorials-ex5_superlu_dist_3 # SKIP Command failed so no diff >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist.counts >>>> ok ksp_ksp_tutorials-ex5f_superlu_dist # SKIP Fortran required for this >>>> test >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_tri_parmetis_hpddm_baij.counts >>>> ok snes_tutorials-ex12_tri_parmetis_hpddm_baij >>>> ok diff-snes_tutorials-ex12_tri_parmetis_hpddm_baij >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_tut_3.counts >>>> ok snes_tutorials-ex19_tut_3 >>>> ok diff-snes_tutorials-ex19_tut_3 >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_vlap.counts >>>> ok snes_tutorials-ex17_3d_q3_trig_vlap >>>> ok diff-snes_tutorials-ex17_3d_q3_trig_vlap >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_3.counts >>>> ok ksp_ksp_tutorials-ex5f_superlu_dist_3 # SKIP Fortran required for this >>>> test >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist.counts >>>> ok snes_tutorials-ex19_superlu_dist >>>> ok diff-snes_tutorials-ex19_superlu_dist >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre.counts >>>> ok snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre >>>> ok diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex49_hypre_nullspace.counts >>>> ok ksp_ksp_tutorials-ex49_hypre_nullspace >>>> ok diff-ksp_ksp_tutorials-ex49_hypre_nullspace >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist_2.counts >>>> ok snes_tutorials-ex19_superlu_dist_2 >>>> ok diff-snes_tutorials-ex19_superlu_dist_2 >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_2.counts >>>> not ok ksp_ksp_tutorials-ex5_superlu_dist_2 # Error code: 1 >>>> # srun: error: Unable to create step for job 1426755: More processors >>>> requested than permitted >>>> ok ksp_ksp_tutorials-ex5_superlu_dist_2 # SKIP Command failed so no diff >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre.counts >>>> ok snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre >>>> ok diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex64_1.counts >>>> ok ksp_ksp_tutorials-ex64_1 >>>> ok diff-ksp_ksp_tutorials-ex64_1 >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist.counts >>>> not ok ksp_ksp_tutorials-ex5_superlu_dist # Error code: 1 >>>> # srun: error: Unable to create step for job 1426755: More processors >>>> requested than permitted >>>> ok ksp_ksp_tutorials-ex5_superlu_dist # SKIP Command failed so no diff >>>> TEST >>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_2.counts >>>> ok ksp_ksp_tutorials-ex5f_superlu_dist_2 # SKIP Fortran required for this >>>> test >>>> >>>>> Barry >>>>> >>>>> >>>>>> On Mar 10, 2021, at 11:03 PM, Eric Chamberland >>>>>> <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> >>>>>> Barry, >>>>>> >>>>>> to get a some follow up on --with-openmp=1 failures, shall I open gitlab >>>>>> issues for: >>>>>> >>>>>> a) all hypre failures giving DIVERGED_INDEFINITE_PC >>>>>> >>>>>> b) all superlu_dist failures giving different results with initia and >>>>>> "Exceeded timeout limit of 60 s" >>>>>> >>>>>> c) hpddm failures "free(): invalid next size (fast)" and "Segmentation >>>>>> Violation" >>>>>> >>>>>> d) all tao's "Exceeded timeout limit of 60 s" >>>>>> >>>>>> I don't see how I could do all these debugging by myself... >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Eric >>>>>> >>>>>> >>>>> >>>> >>> -- >>> Eric Chamberland, ing., M. Ing >>> Professionnel de recherche >>> GIREF/Université Laval >>> (418) 656-2131 poste 41 22 42 >> -- >> Eric Chamberland, ing., M. Ing >> Professionnel de recherche >> GIREF/Université Laval >> (418) 656-2131 poste 41 22 42 >> <fedora_mkl_and_devtools.txt><openmpi.txt><petsc.txt> >
