Eric

You should report these HYPRE issues upstream 
https://github.com/hypre-space/hypre/issues 
<https://github.com/hypre-space/hypre/issues>


> On Mar 14, 2021, at 3:44 AM, Eric Chamberland 
> <[email protected]> wrote:
> 
> For us it clearly creates problems in real computations...
> 
> I understand the need to have clean test for PETSc, but for me, it reveals 
> that hypre isn't usable with more than one thread for now...
> 
> Another solution:  force single-threaded configuration for hypre until this 
> is fixed?
> 
> Eric
> 
> On 2021-03-13 8:50 a.m., Pierre Jolivet wrote:
>> -pc_hypre_boomeramg_relax_type_all Jacobi => 
>>   Linear solve did not converge due to DIVERGED_INDEFINITE_PC iterations 3
>> -pc_hypre_boomeramg_relax_type_all l1scaled-Jacobi => 
>> OK, independently of the architecture it seems (Eric Docker image with 1 or 
>> 2 threads or my macOS), but contraction factor is higher
>>   Linear solve converged due to CONVERGED_RTOL iterations 8
>>   Linear solve converged due to CONVERGED_RTOL iterations 24
>>   Linear solve converged due to CONVERGED_RTOL iterations 26
>> v. currently
>>   Linear solve converged due to CONVERGED_RTOL iterations 7
>>   Linear solve converged due to CONVERGED_RTOL iterations 9
>>   Linear solve converged due to CONVERGED_RTOL iterations 10
>> 
>> Do we change this? Or should we force OMP_NUM_THREADS=1 for make test?
>> 
>> Thanks,
>> Pierre
>> 
>>> On 13 Mar 2021, at 2:26 PM, Mark Adams <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hypre uses a multiplicative smoother by default. It has a chebyshev 
>>> smoother. That with a Jacobi PC should be thread invariant.
>>> Mark
>>> 
>>> On Sat, Mar 13, 2021 at 8:18 AM Pierre Jolivet <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>>> On 13 Mar 2021, at 9:17 AM, Pierre Jolivet <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Hello Eric,
>>>> I’ve made an “interesting” discovery, so I’ll put back the list in c/c.
>>>> It appears the following snippet of code which uses Allreduce() + lambda 
>>>> function + MPI_IN_PLACE is:
>>>> - Valgrind-clean with MPICH;
>>>> - Valgrind-clean with OpenMPI 4.0.5;
>>>> - not Valgrind-clean with OpenMPI 4.1.0.
>>>> I’m not sure who is to blame here, I’ll need to look at the MPI 
>>>> specification for what is required by the implementors and users in that 
>>>> case.
>>>> 
>>>> In the meantime, I’ll do the following:
>>>> - update config/BuildSystem/config/packages/OpenMPI.py to use OpenMPI 
>>>> 4.1.0, see if any other error appears;
>>>> - provide a hotfix to bypass the segfaults;
>>> 
>>> I can confirm that splitting the single Allreduce with my own MPI_Op into 
>>> two Allreduce with MAX and BAND fixes the segfaults with OpenMPI (*).
>>> 
>>>> - look at the hypre issue and whether they should be deferred to the hypre 
>>>> team.
>>> 
>>> I don’t know if there is something wrong in hypre threading or if it’s just 
>>> a side effect of threading, but it seems that the number of threads has a 
>>> drastic effect on the quality of the PC.
>>> By default, it looks that there are two threads per process with your 
>>> Docker image.
>>> If I force OMP_NUM_THREADS=1, then I get the same convergence as in the 
>>> output file.
>>> 
>>> Thanks,
>>> Pierre
>>> 
>>> (*) https://gitlab.com/petsc/petsc/-/merge_requests/3712 
>>> <https://gitlab.com/petsc/petsc/-/merge_requests/3712>
>>>> Thank you for the Docker files, they were really useful.
>>>> If you want to avoid oversubscription failures, you can edit the file 
>>>> /opt/openmpi-4.1.0/etc/openmpi-default-hostfile and append the line:
>>>> localhost slots=12
>>>> If you want to increase the timeout limit of PETSc test suite for each 
>>>> test, you can add the extra flag in your command line TIMEOUT=180 (default 
>>>> is 60, units are seconds).
>>>> 
>>>> Thanks, I’ll ping you on GitLab when I’ve got something ready for you to 
>>>> try,
>>>> Pierre
>>>> 
>>>> <ompi.cxx>
>>>> 
>>>>> On 12 Mar 2021, at 8:54 PM, Eric Chamberland 
>>>>> <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Hi Pierre,
>>>>> 
>>>>> I now have a docker container reproducing the problems here.
>>>>> 
>>>>> Actually, if I look at snes_tutorials-ex12_quad_singular_hpddm  it fails 
>>>>> like this:
>>>>> 
>>>>> not ok snes_tutorials-ex12_quad_singular_hpddm # Error code: 59
>>>>> #       Initial guess
>>>>> #       L_2 Error: 0.00803099
>>>>> #       Initial Residual
>>>>> #       L_2 Residual: 1.09057
>>>>> #       Au - b = Au + F(0)
>>>>> #       Linear L_2 Residual: 1.09057
>>>>> #       [d470c54ce086:14127] Read -1, expected 4096, errno = 1
>>>>> #       [d470c54ce086:14128] Read -1, expected 4096, errno = 1
>>>>> #       [d470c54ce086:14129] Read -1, expected 4096, errno = 1
>>>>> #       [3]PETSC ERROR: 
>>>>> ------------------------------------------------------------------------
>>>>> #       [3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation 
>>>>> Violation, probably memory access out of range
>>>>> #       [3]PETSC ERROR: Try option -start_in_debugger or 
>>>>> -on_error_attach_debugger
>>>>> #       [3]PETSC ERROR: or see 
>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind 
>>>>> <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind>
>>>>> #       [3]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> 
>>>>> on GNU/linux and Apple Mac OS X to find memory corruption errors
>>>>> #       [3]PETSC ERROR: likely location of problem given in stack below
>>>>> #       [3]PETSC ERROR: ---------------------  Stack Frames 
>>>>> ------------------------------------
>>>>> #       [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not 
>>>>> available,
>>>>> #       [3]PETSC ERROR:       INSTEAD the line number of the start of the 
>>>>> function
>>>>> #       [3]PETSC ERROR:       is given.
>>>>> #       [3]PETSC ERROR: [3] buildTwo line 987 
>>>>> /opt/petsc-main/include/HPDDM_schwarz.hpp
>>>>> #       [3]PETSC ERROR: [3] next line 1130 
>>>>> /opt/petsc-main/include/HPDDM_schwarz.hpp
>>>>> #       [3]PETSC ERROR: --------------------- Error Message 
>>>>> --------------------------------------------------------------
>>>>> #       [3]PETSC ERROR: Signal received
>>>>> #       [3]PETSC ERROR: [0]PETSC ERROR: 
>>>>> ------------------------------------------------------------------------
>>>>> 
>>>>> also ex12_quad_hpddm_reuse_baij fails with a lot more "Read -1, expected 
>>>>> ..." which I don't know where they come from...?
>>>>> 
>>>>> Hypre (like in diff-snes_tutorials-ex56_hypre)  is also having 
>>>>> DIVERGED_INDEFINITE_PC failures...
>>>>> 
>>>>> Please see the 3 attached docker files:
>>>>> 
>>>>> 1) fedora_mkl_and_devtools : the DockerFile which install fedore 33 with 
>>>>> gnu compilers and MKL and everything to develop.
>>>>> 
>>>>> 2) openmpi: the DockerFile to bluid OpenMPI
>>>>> 
>>>>> 3) petsc: The las DockerFile that build/install and test PETSc
>>>>> 
>>>>> I build the 3 like this:
>>>>> 
>>>>> docker build -t fedora_mkl_and_devtools -f fedora_mkl_and_devtools .
>>>>> 
>>>>> docker build -t openmpi -f openmpi .
>>>>> 
>>>>> docker build -t petsc -f petsc .
>>>>> 
>>>>> Disclaimer: I am not a docker expert, so I may do things that are not 
>>>>> docker-stat-of-the-art but I am opened to suggestions... ;)
>>>>> 
>>>>> I have just ran it on my portable (long) which have not enough cores, so 
>>>>> many more tests failed (should force --oversubscribe but don't know how 
>>>>> to).  I will relaunch on my workstation in a few minutes.
>>>>> 
>>>>> I will now test your branch! (sorry for the delay).
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Eric
>>>>> 
>>>>> On 2021-03-11 9:03 a.m., Eric Chamberland wrote:
>>>>>> Hi Pierre,
>>>>>> 
>>>>>> ok, that's interesting!
>>>>>> 
>>>>>> I will try to build a docker image until tomorrow and give you the exact 
>>>>>> recipe to reproduce the bugs.
>>>>>> 
>>>>>> Eric
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 2021-03-11 2:46 a.m., Pierre Jolivet wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On 11 Mar 2021, at 6:16 AM, Barry Smith <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>   Eric,
>>>>>>>> 
>>>>>>>>    Sorry about not being more immediate. We still have this in our 
>>>>>>>> active email so you don't need to submit individual issues. We'll try 
>>>>>>>> to get to them as soon as we can.
>>>>>>> 
>>>>>>> Indeed, I’m still trying to figure this out.
>>>>>>> I realized that some of my configure flags were different than yours, 
>>>>>>> e.g., no --with-memalign.
>>>>>>> I’ve also added SuperLU_DIST to my installation.
>>>>>>> Still, I can’t reproduce any issue.
>>>>>>> I will continue looking into this, it appears I’m seeing some valgrind 
>>>>>>> errors, but I don’t know if this is some side effect of OpenMPI not 
>>>>>>> being valgrind-clean (last time I checked, there was no error with 
>>>>>>> MPICH).
>>>>>>> 
>>>>>>> Thank you for your patience,
>>>>>>> Pierre
>>>>>>> 
>>>>>>> /usr/bin/gmake -f gmakefile test test-fail=1
>>>>>>> Using MAKEFLAGS: test-fail=1
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_baij.counts
>>>>>>>  ok snes_tutorials-ex12_quad_hpddm_reuse_baij
>>>>>>>  ok diff-snes_tutorials-ex12_quad_hpddm_reuse_baij
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist_2.counts
>>>>>>>  ok ksp_ksp_tests-ex33_superlu_dist_2
>>>>>>>  ok diff-ksp_ksp_tests-ex33_superlu_dist_2
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex49_superlu_dist.counts
>>>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
>>>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
>>>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
>>>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
>>>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
>>>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
>>>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
>>>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
>>>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
>>>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
>>>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
>>>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
>>>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
>>>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
>>>>>>>  ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
>>>>>>>  ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex50_tut_2.counts
>>>>>>>  ok ksp_ksp_tutorials-ex50_tut_2
>>>>>>>  ok diff-ksp_ksp_tutorials-ex50_tut_2
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist.counts
>>>>>>>  ok ksp_ksp_tests-ex33_superlu_dist
>>>>>>>  ok diff-ksp_ksp_tests-ex33_superlu_dist
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_hypre.counts
>>>>>>>  ok snes_tutorials-ex56_hypre
>>>>>>>  ok diff-snes_tutorials-ex56_hypre
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex56_2.counts
>>>>>>>  ok ksp_ksp_tutorials-ex56_2
>>>>>>>  ok diff-ksp_ksp_tutorials-ex56_2
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_elas.counts
>>>>>>>  ok snes_tutorials-ex17_3d_q3_trig_elas
>>>>>>>  ok diff-snes_tutorials-ex17_3d_q3_trig_elas
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij.counts
>>>>>>>  ok snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
>>>>>>>  ok diff-snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_3.counts
>>>>>>> not ok ksp_ksp_tutorials-ex5_superlu_dist_3 # Error code: 1
>>>>>>> #       srun: error: Unable to create step for job 1426755: More 
>>>>>>> processors requested than permitted
>>>>>>>  ok ksp_ksp_tutorials-ex5_superlu_dist_3 # SKIP Command failed so no 
>>>>>>> diff
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist.counts
>>>>>>>  ok ksp_ksp_tutorials-ex5f_superlu_dist # SKIP Fortran required for 
>>>>>>> this test
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_tri_parmetis_hpddm_baij.counts
>>>>>>>  ok snes_tutorials-ex12_tri_parmetis_hpddm_baij
>>>>>>>  ok diff-snes_tutorials-ex12_tri_parmetis_hpddm_baij
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_tut_3.counts
>>>>>>>  ok snes_tutorials-ex19_tut_3
>>>>>>>  ok diff-snes_tutorials-ex19_tut_3
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_vlap.counts
>>>>>>>  ok snes_tutorials-ex17_3d_q3_trig_vlap
>>>>>>>  ok diff-snes_tutorials-ex17_3d_q3_trig_vlap
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_3.counts
>>>>>>>  ok ksp_ksp_tutorials-ex5f_superlu_dist_3 # SKIP Fortran required for 
>>>>>>> this test
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist.counts
>>>>>>>  ok snes_tutorials-ex19_superlu_dist
>>>>>>>  ok diff-snes_tutorials-ex19_superlu_dist
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre.counts
>>>>>>>  ok snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
>>>>>>>  ok 
>>>>>>> diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex49_hypre_nullspace.counts
>>>>>>>  ok ksp_ksp_tutorials-ex49_hypre_nullspace
>>>>>>>  ok diff-ksp_ksp_tutorials-ex49_hypre_nullspace
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist_2.counts
>>>>>>>  ok snes_tutorials-ex19_superlu_dist_2
>>>>>>>  ok diff-snes_tutorials-ex19_superlu_dist_2
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_2.counts
>>>>>>> not ok ksp_ksp_tutorials-ex5_superlu_dist_2 # Error code: 1
>>>>>>> #       srun: error: Unable to create step for job 1426755: More 
>>>>>>> processors requested than permitted
>>>>>>>  ok ksp_ksp_tutorials-ex5_superlu_dist_2 # SKIP Command failed so no 
>>>>>>> diff
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre.counts
>>>>>>>  ok snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
>>>>>>>  ok 
>>>>>>> diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex64_1.counts
>>>>>>>  ok ksp_ksp_tutorials-ex64_1
>>>>>>>  ok diff-ksp_ksp_tutorials-ex64_1
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist.counts
>>>>>>> not ok ksp_ksp_tutorials-ex5_superlu_dist # Error code: 1
>>>>>>> #       srun: error: Unable to create step for job 1426755: More 
>>>>>>> processors requested than permitted
>>>>>>>  ok ksp_ksp_tutorials-ex5_superlu_dist # SKIP Command failed so no diff
>>>>>>>         TEST 
>>>>>>> arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_2.counts
>>>>>>>  ok ksp_ksp_tutorials-ex5f_superlu_dist_2 # SKIP Fortran required for 
>>>>>>> this test
>>>>>>> 
>>>>>>>>    Barry
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Mar 10, 2021, at 11:03 PM, Eric Chamberland 
>>>>>>>>> <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>> 
>>>>>>>>> Barry,
>>>>>>>>> 
>>>>>>>>> to get a some follow up on --with-openmp=1 failures, shall I open 
>>>>>>>>> gitlab issues for:
>>>>>>>>> 
>>>>>>>>> a) all hypre failures giving DIVERGED_INDEFINITE_PC
>>>>>>>>> 
>>>>>>>>> b) all superlu_dist failures giving different results with initia and 
>>>>>>>>> "Exceeded timeout limit of 60 s"
>>>>>>>>> 
>>>>>>>>> c) hpddm failures "free(): invalid next size (fast)" and 
>>>>>>>>> "Segmentation Violation"
>>>>>>>>> 
>>>>>>>>> d) all tao's "Exceeded timeout limit of 60 s"
>>>>>>>>> 
>>>>>>>>> I don't see how I could do all these debugging by myself...
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Eric
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> -- 
>>>>>> Eric Chamberland, ing., M. Ing
>>>>>> Professionnel de recherche
>>>>>> GIREF/Université Laval
>>>>>> (418) 656-2131 poste 41 22 42
>>>>> -- 
>>>>> Eric Chamberland, ing., M. Ing
>>>>> Professionnel de recherche
>>>>> GIREF/Université Laval
>>>>> (418) 656-2131 poste 41 22 42
>>>>> <fedora_mkl_and_devtools.txt><openmpi.txt><petsc.txt>
>>>> 
>>> 
>> 
> -- 
> Eric Chamberland, ing., M. Ing
> Professionnel de recherche
> GIREF/Université Laval
> (418) 656-2131 poste 41 22 42

Reply via email to