Hi Pierre,

I now have a docker container reproducing the problems here.

Actually, if I look at snes_tutorials-ex12_quad_singular_hpddm it fails like this:

not ok snes_tutorials-ex12_quad_singular_hpddm # Error code: 59
#       Initial guess
#       L_2 Error: 0.00803099
#       Initial Residual
#       L_2 Residual: 1.09057
#       Au - b = Au + F(0)
#       Linear L_2 Residual: 1.09057
#       [d470c54ce086:14127] Read -1, expected 4096, errno = 1
#       [d470c54ce086:14128] Read -1, expected 4096, errno = 1
#       [d470c54ce086:14129] Read -1, expected 4096, errno = 1
#       [3]PETSC ERROR: ------------------------------------------------------------------------ #       [3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range #       [3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger #       [3]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind #       [3]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
#       [3]PETSC ERROR: likely location of problem given in stack below
#       [3]PETSC ERROR: ---------------------  Stack Frames ------------------------------------ #       [3]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, #       [3]PETSC ERROR:       INSTEAD the line number of the start of the function
#       [3]PETSC ERROR:       is given.
#       [3]PETSC ERROR: [3] buildTwo line 987 /opt/petsc-main/include/HPDDM_schwarz.hpp #       [3]PETSC ERROR: [3] next line 1130 /opt/petsc-main/include/HPDDM_schwarz.hpp #       [3]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
#       [3]PETSC ERROR: Signal received
#       [3]PETSC ERROR: [0]PETSC ERROR: ------------------------------------------------------------------------

also ex12_quad_hpddm_reuse_baij fails with a lot more "Read -1, expected ..." which I don't know where they come from...?

Hypre (like in diff-snes_tutorials-ex56_hypre)  is also having DIVERGED_INDEFINITE_PC failures...

Please see the 3 attached docker files:

1) fedora_mkl_and_devtools : the DockerFile which install fedore 33 with gnu compilers and MKL and everything to develop.

2) openmpi: the DockerFile to bluid OpenMPI

3) petsc: The las DockerFile that build/install and test PETSc

I build the 3 like this:

docker build -t fedora_mkl_and_devtools -f fedora_mkl_and_devtools .

docker build -t openmpi -f openmpi .

docker build -t petsc -f petsc .

Disclaimer: I am not a docker expert, so I may do things that are not docker-stat-of-the-art but I am opened to suggestions... ;)

I have just ran it on my portable (long) which have not enough cores, so many more tests failed (should force --oversubscribe but don't know how to).  I will relaunch on my workstation in a few minutes.

I will now test your branch! (sorry for the delay).

Thanks,

Eric

On 2021-03-11 9:03 a.m., Eric Chamberland wrote:

Hi Pierre,

ok, that's interesting!

I will try to build a docker image until tomorrow and give you the exact recipe to reproduce the bugs.

Eric


On 2021-03-11 2:46 a.m., Pierre Jolivet wrote:


On 11 Mar 2021, at 6:16 AM, Barry Smith <[email protected] <mailto:[email protected]>> wrote:


  Eric,

   Sorry about not being more immediate. We still have this in our active email so you don't need to submit individual issues. We'll try to get to them as soon as we can.

Indeed, I’m still trying to figure this out.
I realized that some of my configure flags were different than yours, e.g., no --with-memalign.
I’ve also added SuperLU_DIST to my installation.
Still, I can’t reproduce any issue.
I will continue looking into this, it appears I’m seeing some valgrind errors, but I don’t know if this is some side effect of OpenMPI not being valgrind-clean (last time I checked, there was no error with MPICH).

Thank you for your patience,
Pierre

/usr/bin/gmake -f gmakefile test test-fail=1
Using MAKEFLAGS: test-fail=1
        TEST arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_baij.counts
 ok snes_tutorials-ex12_quad_hpddm_reuse_baij
 ok diff-snes_tutorials-ex12_quad_hpddm_reuse_baij
        TEST arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist_2.counts
 ok ksp_ksp_tests-ex33_superlu_dist_2
 ok diff-ksp_ksp_tests-ex33_superlu_dist_2
        TEST arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex49_superlu_dist.counts
 ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
 ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-0
 ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
 ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-0_conv-1
 ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
 ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-0
 ok ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
 ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-1herm-1_conv-1
 ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
 ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-0
 ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
 ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-0_conv-1
 ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
 ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-0
 ok ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
 ok diff-ksp_ksp_tests-ex49_superlu_dist+nsize-4herm-1_conv-1
        TEST arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex50_tut_2.counts
 ok ksp_ksp_tutorials-ex50_tut_2
 ok diff-ksp_ksp_tutorials-ex50_tut_2
        TEST arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tests-ex33_superlu_dist.counts
 ok ksp_ksp_tests-ex33_superlu_dist
 ok diff-ksp_ksp_tests-ex33_superlu_dist
        TEST arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_hypre.counts
 ok snes_tutorials-ex56_hypre
 ok diff-snes_tutorials-ex56_hypre
        TEST arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex56_2.counts
 ok ksp_ksp_tutorials-ex56_2
 ok diff-ksp_ksp_tutorials-ex56_2
        TEST arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_elas.counts
 ok snes_tutorials-ex17_3d_q3_trig_elas
 ok diff-snes_tutorials-ex17_3d_q3_trig_elas
        TEST arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij.counts
 ok snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
 ok diff-snes_tutorials-ex12_quad_hpddm_reuse_threshold_baij
        TEST arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_3.counts
not ok ksp_ksp_tutorials-ex5_superlu_dist_3 # Error code: 1
#srun: error: Unable to create step for job 1426755: More processors requested than permitted
 ok ksp_ksp_tutorials-ex5_superlu_dist_3 # SKIP Command failed so no diff
        TEST arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist.counts  ok ksp_ksp_tutorials-ex5f_superlu_dist # SKIP Fortran required for this test         TEST arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex12_tri_parmetis_hpddm_baij.counts
 ok snes_tutorials-ex12_tri_parmetis_hpddm_baij
 ok diff-snes_tutorials-ex12_tri_parmetis_hpddm_baij
        TEST arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_tut_3.counts
 ok snes_tutorials-ex19_tut_3
 ok diff-snes_tutorials-ex19_tut_3
        TEST arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex17_3d_q3_trig_vlap.counts
 ok snes_tutorials-ex17_3d_q3_trig_vlap
 ok diff-snes_tutorials-ex17_3d_q3_trig_vlap
        TEST arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_3.counts  ok ksp_ksp_tutorials-ex5f_superlu_dist_3 # SKIP Fortran required for this test         TEST arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist.counts
 ok snes_tutorials-ex19_superlu_dist
 ok diff-snes_tutorials-ex19_superlu_dist
        TEST arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre.counts
 ok snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
 ok diff-snes_tutorials-ex56_attach_mat_nearnullspace-1_bddc_approx_hypre
        TEST arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex49_hypre_nullspace.counts
 ok ksp_ksp_tutorials-ex49_hypre_nullspace
 ok diff-ksp_ksp_tutorials-ex49_hypre_nullspace
        TEST arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex19_superlu_dist_2.counts
 ok snes_tutorials-ex19_superlu_dist_2
 ok diff-snes_tutorials-ex19_superlu_dist_2
        TEST arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist_2.counts
not ok ksp_ksp_tutorials-ex5_superlu_dist_2 # Error code: 1
#srun: error: Unable to create step for job 1426755: More processors requested than permitted
 ok ksp_ksp_tutorials-ex5_superlu_dist_2 # SKIP Command failed so no diff
        TEST arch-linux2-c-opt-ompi/tests/counts/snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre.counts
 ok snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
 ok diff-snes_tutorials-ex56_attach_mat_nearnullspace-0_bddc_approx_hypre
        TEST arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex64_1.counts
 ok ksp_ksp_tutorials-ex64_1
 ok diff-ksp_ksp_tutorials-ex64_1
        TEST arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5_superlu_dist.counts
not ok ksp_ksp_tutorials-ex5_superlu_dist # Error code: 1
#srun: error: Unable to create step for job 1426755: More processors requested than permitted
 ok ksp_ksp_tutorials-ex5_superlu_dist # SKIP Command failed so no diff
        TEST arch-linux2-c-opt-ompi/tests/counts/ksp_ksp_tutorials-ex5f_superlu_dist_2.counts  ok ksp_ksp_tutorials-ex5f_superlu_dist_2 # SKIP Fortran required for this test

   Barry


On Mar 10, 2021, at 11:03 PM, Eric Chamberland <[email protected] <mailto:[email protected]>> wrote:

Barry,

to get a some follow up on --with-openmp=1 failures, shall I open gitlab issues for:

a) all hypre failures giving DIVERGED_INDEFINITE_PC

b) all superlu_dist failures giving different results with initia and "Exceeded timeout limit of 60 s"

c) hpddm failures "free(): invalid next size (fast)" and "Segmentation Violation"

d) all tao's "Exceeded timeout limit of 60 s"

I don't see how I could do all these debugging by myself...

Thanks,

Eric




--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42

--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42

# Image de départ.
FROM fedora:33

SHELL ["/bin/bash", "-c"]

WORKDIR /

# InteOneAPI repo configuration and other packages for compiling OpenMPI and 
PETSc:
# (see 
https://software.intel.com/content/www/us/en/develop/articles/installing-intel-oneapi-toolkits-via-yum.html)

## on fixe le fuseau horaire dans le conteneur:
ENV TZ=America/New_York

RUN \
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime \
  && \
echo $TZ > /etc/timezone \
  && \
echo "LC_ALL=en_US.UTF-8" >> /etc/environment \
  &&  \
echo "en_US.UTF-8 UTF-8"  >> /etc/locale.gen \
  &&   \
echo "LANG=en_US.UTF-8"   >  /etc/locale.conf \
  &&  \
source /etc/locale.conf \
  && \
echo -e "\
[oneAPI]\n\
name=Intel(R) oneAPI repository\n\
baseurl=https://yum.repos.intel.com/oneapi\n\
enabled=1\n\
gpgcheck=1\n\
repo_gpgcheck=1\n\
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB\n\
" > /etc/yum.repos.d/oneAPI.repo \
  &&\
dnf install -y \
   authconfig \
   autoconf \
   binutils \
   bison \
   blas-devel \
   ccache \
   clang \
   cmake \
   flex \
   gcc-c++ \
   gcc-gfortran \
   gdb \
   git \
   glibc-langpack-en \
   gnuplot \
   intel-oneapi-mkl-devel \
   libtool \
   libtirpc-devel \
   libXext-devel \
   libX11-devel \
   make \
   nfs-utils \
   numactl-libs \
   numactl-devel \
   nscd \
   perl \
   "perl(Data::Dumper)" \
   pkg-config \
   procps-ng \
   python2 \
   python2-six \
   python \
   screen \
   tar \
   time \
   valgrind \
   vim \
   wget \
   xorg-x11-apps \
  && \
dnf clean all

#   intel-hpckit \

# Exécuter une commande au démarrage de l'image.
#CMD ["/bin/bash"]

# Image de départ.
FROM fedora_mkl_and_devtools:latest

SHELL ["/bin/bash", "-c"]

WORKDIR /

ARG ompi_ver=openmpi-4.1.0
ARG ompi_tar=${ompi_ver}.tar.gz
ARG ompi_rep_dest=/opt/${ompi_ver}

ENV MPIdir=${ompi_rep_dest}

RUN \
wget https://www.open-mpi.org/software/ompi/v4.1/downloads/${ompi_tar} \
&& \
tar -xvf ${ompi_tar} \
&& \
cd ${ompi_ver} \
&& \
./configure \
   --prefix=${ompi_rep_dest} \
   CXXFLAGS=-std=c++14\
   --with-wrapper-cxxflags='-std=c++14' \
   --with-cma \
   --enable-mpi1-compatibility \
   && \
make -j8 \
&& \
make install \
&& \
echo -e "\
export MPIdir=${ompi_rep_dest}\n\
export LD_LIBRARY_PATH=\${MPIdir}/${MPIlibdir}:\${LD_LIBRARY_PATH}\n\
export PATH=\${MPIdir}/bin:\${PATH}" > ${ompi_rep_dest}/mpilibs.sh


# Exécuter une commande au démarrage de l'image.
#CMD ["/bin/bash"]
# Image de départ.
FROM openmpi:latest

SHELL ["/bin/bash", "-c"]

WORKDIR /

ARG petsc_branch=main
ARG petsc_ver=petsc-${petsc_branch}
ARG petsc_rep_dest=/opt/${petsc_ver}

RUN \
git clone https://gitlab.com/petsc/petsc.git -b main && \
cd petsc

RUN \
source /opt/intel/oneapi/mkl/latest/env/vars.sh intel64 && \
source ${MPIdir}/mpilibs.sh && \
cd petsc && \
  ./configure \
   --prefix=${petsc_rep_dest} \
   --with-mpi-compilers=1 --with-mpi-dir=${MPIdir} \
   --download-ml=yes \
   --download-mumps=yes \
   --download-superlu=yes \
   --with-cxx-dialect=C++14 \
   --with-make-np=12 \
   --with-shared-libraries=1 \
   --with-debugging=1 \
   --with-memalign=64 \
   --with-visibility=0 \
   --with-openmp=1 \
   --with-64-bit-indices=0 \
   --download-hpddm=yes \
   --download-slepc=yes \
   --download-superlu_dist=yes \
   --download-parmetis=yes \
   --download-ptscotch=yes \
   --download-metis=yes \
   --download-strumpack=yes \
   --download-suitesparse=yes \
   --download-hypre=yes \
   --with-blaslapack-dir="$MKLROOT/lib/intel64" \
   --with-mkl_pardiso-dir="$MKLROOT" \
   --with-mkl_cpardiso-dir="$MKLROOT" \
   --with-scalapack=1 \
   --with-scalapack-include="$MKLROOT/include" \
   --with-scalapack-lib="-L$MKLROOT/lib/intel64 -lmkl_scalapack_lp64 
-lmkl_blacs_openmpi_lp64" \
   && \
   export PETSC_ARCH_VAR=$(tail -20 configure.log |grep "PETSC_ARCH:"|awk 
'{print $2}') && \
   export PETSC_DIR_VAR=$(tail -20 configure.log |grep "PETSC_DIR:"|awk '{print 
$2}') && \
   make PETSC_DIR="$PETSC_DIR_VAR" PETSC_ARCH="$PETSC_ARCH_VAR" all && \
   make PETSC_DIR="$PETSC_DIR_VAR" PETSC_ARCH="$PETSC_ARCH_VAR" install && \
   touch "${petsc_rep_dest}/hpclibs.sh" && \
   echo -e "source $MKLROOT/env/vars.sh  intel64\n\
   source ${MPIdir}/mpilibs.sh\n\
   export PETSC_DIR=${petsc_rep_dest}\n\
   export PETSC_ARCH=\"\"\n\
   export LD_LIBRARY_PATH=\${PETSC_DIR}/lib:\${LD_LIBRARY_PATH}\n\
   export PATH=\${PETSC_DIR}/bin:\${MPIdir}/\${MPIbindir}:\${PATH}\n" >> 
"${petsc_rep_dest}/hpclibs.sh"

ENV OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \
    OMPI_ALLOW_RUN_AS_ROOT=1

RUN source ${petsc_rep_dest}/hpclibs.sh \
&& \
cd /petsc \
&& \
export PETSC_ARCH_VAR=$(tail -20 configure.log |grep "PETSC_ARCH:"|awk '{print 
$2}') \
&& \
export PETSC_DIR_VAR=$(tail -20 configure.log |grep "PETSC_DIR:"|awk '{print 
$2}') \
&& \
make PETSC_DIR="$PETSC_DIR_VAR" PETSC_ARCH="$PETSC_ARCH_VAR" test |& tee 
make_test.log

# Exécuter une commande au démarrage de l'image.
#CMD ["cd /petsc; echo "You can source ${petsc_rep_dest}/hpclibs.sh to use 
PETSc"]

Reply via email to