-pc_hypre_boomeramg_relax_type_all Jacobi => 
  Linear solve did not converge due to DIVERGED_INDEFINITE_PC iterations 3
-pc_hypre_boomeramg_relax_type_all l1scaled-Jacobi => 
OK, independently of the architecture it seems (Eric Docker image with 1 or 2 
threads or my macOS), but contraction factor is higher
  Linear solve converged due to CONVERGED_RTOL iterations 8
  Linear solve converged due to CONVERGED_RTOL iterations 24
  Linear solve converged due to CONVERGED_RTOL iterations 26
v. currently
  Linear solve converged due to CONVERGED_RTOL iterations 7
  Linear solve converged due to CONVERGED_RTOL iterations 9
  Linear solve converged due to CONVERGED_RTOL iterations 10

Do we change this? Or should we force OMP_NUM_THREADS=1 for make test?

Thanks,
Pierre

> On 13 Mar 2021, at 2:26 PM, Mark Adams <[email protected]> wrote:
> 
> Hypre uses a multiplicative smoother by default. It has a chebyshev smoother. 
> That with a Jacobi PC should be thread invariant.
> Mark
> 
> On Sat, Mar 13, 2021 at 8:18 AM Pierre Jolivet <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> On 13 Mar 2021, at 9:17 AM, Pierre Jolivet <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hello Eric,
>> I’ve made an “interesting” discovery, so I’ll put back the list in c/c.
>> It appears the following snippet of code which uses Allreduce() + lambda 
>> function + MPI_IN_PLACE is:
>> - Valgrind-clean with MPICH;
>> - Valgrind-clean with OpenMPI 4.0.5;
>> - not Valgrind-clean with OpenMPI 4.1.0.
>> I’m not sure who is to blame here, I’ll need to look at the MPI 
>> specification for what is required by the implementors and users in that 
>> case.
>> 
>> In the meantime, I’ll do the following:
>> - update config/BuildSystem/config/packages/OpenMPI.py to use OpenMPI 4.1.0, 
>> see if any other error appears;
>> - provide a hotfix to bypass the segfaults;
> 
> I can confirm that splitting the single Allreduce with my own MPI_Op into two 
> Allreduce with MAX and BAND fixes the segfaults with OpenMPI (*).
> 
>> - look at the hypre issue and whether they should be deferred to the hypre 
>> team.
> 
> I don’t know if there is something wrong in hypre threading or if it’s just a 
> side effect of threading, but it seems that the number of threads has a 
> drastic effect on the quality of the PC.
> By default, it looks that there are two threads per process with your Docker 
> image.
> If I force OMP_NUM_THREADS=1, then I get the same convergence as in the 
> output file.
> 
> Thanks,
> Pierre
> 
> (*) https://gitlab.com/petsc/petsc/-/merge_requests/3712 
> <https://gitlab.com/petsc/petsc/-/merge_requests/3712>
>> Thank you for the Docker files, they were really useful.
>> If you want to avoid oversubscription failures, you can edit the file 
>> /opt/openmpi-4.1.0/etc/openmpi-default-hostfile and append the line:
>> localhost slots=12
>> If you want to increase the timeout limit of PETSc test suite for each test, 
>> you can add the extra flag in your command line TIMEOUT=180 (default is 60, 
>> units are seconds).
>> 
>> Thanks, I’ll ping you on GitLab when I’ve got something ready for you to try,
>> Pierre
>> 
>> <ompi.cxx>
>> 
>>> On 12 Mar 2021, at 8:54 PM, Eric Chamberland 
>>> <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hi Pierre,
>>> 
>>> I now have a docker container reproducing the problems here.
>>> 
>>> Actually, if I look at snes_tutorials-ex12_quad_singular_hpddm  it fails 
>>> like this:
>>> 
>>> not ok snes_tutorials-ex12_quad_singular_hpddm # Error code: 59
>>> #       Initial guess
>>> #       L_2 Error: 0.00803099
>>> #       Initial Residual
>>> #       L_2 Residual: 1.09057
>>> #       Au - b = Au + F(0)
>>> #       Linear L_2 Residual: 1.09057
>>> #       [d470c54ce086:14127] Read -1, expected 4096, errno = 1
>>> #       [d470c54ce086:14128] Read -1, expected 4096, errno = 1
>>> #       [d470c54ce086:14129] Read -1, expected 4096, errno = 1
>>> #       [3]PETSC ERROR: 
>>> ------------------------------------------------------------------------
>>> #       [3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation 
>>> Violation, probably memory access out of range
>>> #       [3]PETSC ERROR: Try option -start_in_debugger or 
>>> -on_error_attach_debugger
>>> #       [3]PETSC ERROR: or see 
>>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind 
>>> <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
>>> #       [3]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> 
>>> on GNU/linux and Apple Mac OS X to find memory corruption errors
>>> #       [3]PETSC ERROR: likely location of problem given in stack below
>>> #       [3]PETSC ERROR: ---------------------  Stack Frames 
>>> ------------------------------------
>>> #       [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not 
>>> available,
>>> #       [3]PETSC ERROR:       INSTEAD the line number of the start of the 
>>> function
>>> #       [3]PETSC ERROR:       is given.
>>> #       [3]PETSC ERROR: [3] buildTwo line 987 
>>> /opt/petsc-main/include/HPDDM_schwarz.hpp
>>> #       [3]PETSC ERROR: [3] next line 1130 
>>> /opt/petsc-main/include/HPDDM_schwarz.hpp
>>> #       [3]PETSC ERROR: --------------------- Error Message 
>>> --------------------------------------------------------------
>>> #       [3]PETSC ERROR: Signal received
>>> #       [3]PETSC ERROR: [0]PETSC ERROR: 
>>> ------------------------------------------------------------------------
>>> 
>>> also ex12_quad_hpddm_reuse_baij fails with a lot more "Read -1, expected 
>>> ..." which I don't know where they come from...?
>>> 
>>> Hypre (like in diff-snes_tutorials-ex56_hypre)  is also having 
>>> DIVERGED_INDEFINITE_PC failures...
>>> 
>>> Please see the 3 attached docker files:
>>> 
>>> 1) fedora_mkl_and_devtools : the DockerFile which install fedore 33 with 
>>> gnu compilers and MKL and everything to develop.
>>> 
>>> 2) openmpi: the DockerFile to bluid OpenMPI
>>> 
>>> 3) petsc: The las DockerFile that build/install and test PETSc
>>> 
>>> I build the 3 like this:
>>> 
>>> docker build -t fedora_mkl_and_devtools -f fedora_mkl_and_devtools .
>>> 
>>> docker build -t openmpi -f openmpi .
>>> 
>>> docker build -t petsc -f petsc .
>>> 
>>> Disclaimer: I am not a docker expert, so I may do things that are not 
>>> docker-stat-of-the-art but I am opened to suggestions... ;)
>>> 
>>> I have just ran it on my portable (long) which have not enough cores, so 
>>> many more tests failed (should force --oversubscribe but don't know how 
>>> to).  I will relaunch on my workstation in a few minutes.
>>> 
>>> I will now test your branch! (sorry for the delay).
>>> 
>>> Thanks,
>>> 
>>> Eric
>>> 
>>> On 2021-03-11 9:03 a.m., Eric Chamberland wrote:
>>>> Hi Pierre,
>>>> 
>>>> ok, that's interesting!
>>>> 
>>>> I will try to build a docker image until tomorrow and give you the exact 
>>>> recipe to reproduce the bugs.
>>>> 
>>>> Eric
>>>> 
>>>> 
>>>> 
>>>> On 2021-03-11 2:46 a.m., Pierre Jolivet wrote:
>>>>> 
>>>>> 
>>>>>> On 11 Mar 2021, at 6:16 AM, Barry Smith <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> 
>>>>>>   Eric,
>>>>>> 
>>>>>>    Sorry about not being more immediate. We still have this in our 
>>>>>> active email so you don't need to submit individual issues. We'll try to 
>>>>>> get to them as soon as we can.
>>>>> 
>>>>> Indeed, I’m still trying to figure this out.
>>>>> I realized that some of my configure flags were different than yours, 
>>>>> e.g., no --with-memalign.
>>>>> I’ve also added SuperLU_DIST to my installation.
>>>>> Still, I can’t reproduce any issue.
>>>>> I will continue looking into this, it appears I’m seeing some valgrind 
>>>>> errors, but I don’t know if this is some side effect of OpenMPI not being 
>>>>> valgrind-clean (last time I checked, there was no error with MPICH).
>>>>> 
>>>>> Thank you for your patience,
>>>>> Pierre
>>>>> 
>>>>> /usr/bin/gmake -f gmakefile test test-fail=1
>>>>> Using MAKEFLAGS: test-fail=1
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_baij.counts
>>>>>  ok snes_tutorials-ex12_quad_hpddm_reuse_baij
>>>>>  ok diff-snes_tutorials-ex12_quad_hpddm_reuse_baij
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist_2.counts
>>>>>  ok ksp_ksp_tests-ex33_superlu_dist_2
>>>>>  ok diff-ksp_ksp_tests-ex33_superlu_dist_2
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex49_superlu_dist.counts
>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex50_tut_2.counts
>>>>>  ok ksp_ksp_tutorials-ex50_tut_2
>>>>>  ok diff-ksp_ksp_tutorials-ex50_tut_2
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist.counts
>>>>>  ok ksp_ksp_tests-ex33_superlu_dist
>>>>>  ok diff-ksp_ksp_tests-ex33_superlu_dist
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_hypre.counts
>>>>>  ok snes_tutorials-ex56_hypre
>>>>>  ok diff-snes_tutorials-ex56_hypre
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex56_2.counts
>>>>>  ok ksp_ksp_tutorials-ex56_2
>>>>>  ok diff-ksp_ksp_tutorials-ex56_2
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_elas.counts
>>>>>  ok snes_tutorials-ex17_3d_q3_trig_elas
>>>>>  ok diff-snes_tutorials-ex17_3d_q3_trig_elas
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij.counts
>>>>>  ok snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
>>>>>  ok diff-snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_3.counts
>>>>> not ok ksp_ksp_tutorials-ex5_superlu_dist_3 # Error code: 1
>>>>> # srun: error: Unable to create step for job 1426755: More processors 
>>>>> requested than permitted
>>>>>  ok ksp_ksp_tutorials-ex5_superlu_dist_3 # SKIP Command failed so no diff
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist.counts
>>>>>  ok ksp_ksp_tutorials-ex5f_superlu_dist # SKIP Fortran required for this 
>>>>> test
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_tri_parmetis_hpddm_baij.counts
>>>>>  ok snes_tutorials-ex12_tri_parmetis_hpddm_baij
>>>>>  ok diff-snes_tutorials-ex12_tri_parmetis_hpddm_baij
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_tut_3.counts
>>>>>  ok snes_tutorials-ex19_tut_3
>>>>>  ok diff-snes_tutorials-ex19_tut_3
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_vlap.counts
>>>>>  ok snes_tutorials-ex17_3d_q3_trig_vlap
>>>>>  ok diff-snes_tutorials-ex17_3d_q3_trig_vlap
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_3.counts
>>>>>  ok ksp_ksp_tutorials-ex5f_superlu_dist_3 # SKIP Fortran required for 
>>>>> this test
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist.counts
>>>>>  ok snes_tutorials-ex19_superlu_dist
>>>>>  ok diff-snes_tutorials-ex19_superlu_dist
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre.counts
>>>>>  ok snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
>>>>>  ok diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex49_hypre_nullspace.counts
>>>>>  ok ksp_ksp_tutorials-ex49_hypre_nullspace
>>>>>  ok diff-ksp_ksp_tutorials-ex49_hypre_nullspace
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist_2.counts
>>>>>  ok snes_tutorials-ex19_superlu_dist_2
>>>>>  ok diff-snes_tutorials-ex19_superlu_dist_2
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_2.counts
>>>>> not ok ksp_ksp_tutorials-ex5_superlu_dist_2 # Error code: 1
>>>>> # srun: error: Unable to create step for job 1426755: More processors 
>>>>> requested than permitted
>>>>>  ok ksp_ksp_tutorials-ex5_superlu_dist_2 # SKIP Command failed so no diff
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre.counts
>>>>>  ok snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
>>>>>  ok diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex64_1.counts
>>>>>  ok ksp_ksp_tutorials-ex64_1
>>>>>  ok diff-ksp_ksp_tutorials-ex64_1
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist.counts
>>>>> not ok ksp_ksp_tutorials-ex5_superlu_dist # Error code: 1
>>>>> # srun: error: Unable to create step for job 1426755: More processors 
>>>>> requested than permitted
>>>>>  ok ksp_ksp_tutorials-ex5_superlu_dist # SKIP Command failed so no diff
>>>>>         TEST 
>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_2.counts
>>>>>  ok ksp_ksp_tutorials-ex5f_superlu_dist_2 # SKIP Fortran required for 
>>>>> this test
>>>>> 
>>>>>>    Barry
>>>>>> 
>>>>>> 
>>>>>>> On Mar 10, 2021, at 11:03 PM, Eric Chamberland 
>>>>>>> <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> Barry,
>>>>>>> 
>>>>>>> to get a some follow up on --with-openmp=1 failures, shall I open 
>>>>>>> gitlab issues for:
>>>>>>> 
>>>>>>> a) all hypre failures giving DIVERGED_INDEFINITE_PC
>>>>>>> 
>>>>>>> b) all superlu_dist failures giving different results with initia and 
>>>>>>> "Exceeded timeout limit of 60 s"
>>>>>>> 
>>>>>>> c) hpddm failures "free(): invalid next size (fast)" and "Segmentation 
>>>>>>> Violation"
>>>>>>> 
>>>>>>> d) all tao's "Exceeded timeout limit of 60 s"
>>>>>>> 
>>>>>>> I don't see how I could do all these debugging by myself...
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Eric
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> -- 
>>>> Eric Chamberland, ing., M. Ing
>>>> Professionnel de recherche
>>>> GIREF/Université Laval
>>>> (418) 656-2131 poste 41 22 42
>>> -- 
>>> Eric Chamberland, ing., M. Ing
>>> Professionnel de recherche
>>> GIREF/Université Laval
>>> (418) 656-2131 poste 41 22 42
>>> <fedora_mkl_and_devtools.txt><openmpi.txt><petsc.txt>
>> 
> 

Reply via email to