True, the bug reports come to us and we get the blame.
> On Aug 21, 2020, at 2:50 PM, Matthew Knepley <knep...@gmail.com> wrote: > > On Fri, Aug 21, 2020 at 3:32 PM Barry Smith <bsm...@petsc.dev > <mailto:bsm...@petsc.dev>> wrote: > > Yes, absolutely a test suite will not solve all problems. In the PETSc > model, which is not uncommon, each bug/problem found is suppose to result in > another test to detect that problem, thus the test suite can find repeats of > the problem without again all the hard work from scratch. > > So this OpenMPI suite, if it gets off the ground, will be valuable ONLY if > they accept community additions efficiently and happily. For example would > the test suite detect the problem reported by the PETSc user? It should be > trivial to have the user run the suite on their system (which is why it needs > be very easy to run) and determine. If it does not detect the problem then > working with the appropriate "test suite" community we could submit a MR to > the test suite that looks for the problem and finds it. Now the test suite is > better and we have one less hassle that comes up multiple times for us. In > addition the OpenMPI, MPICH developers etc should do the same thing, each > time they fix a bug that was not detected by testing they should donate to > the universal test suite the code to reproduce the bug. > > The question is would our effort in helping the MPI test suite community be > more than our "wasted" effort dealing with buggy MPIs? > > Barry > > It is a bit curious that after 25 years no friendly extensible universal > MPI test suite community has emerged. Perhaps it is because each MPI > implementation has its own test processes and suites and cannot form the > wider community to have a single friendly extensible universal MPI test > suite. Looking back one could say this was a mistake of the MPI forum, they > should have started that in motion in 1995, would have saved a lot of > duplication of effort and would be very very good now. > > I think they do not do it because people do not hold implementors > accountable, only the packages using MPI. > > Matt > >> On Aug 21, 2020, at 2:17 PM, Junchao Zhang <junchao.zh...@gmail.com >> <mailto:junchao.zh...@gmail.com>> wrote: >> >> Barry, >> I mentioned a test suite from MPICH at >> https://lists.mcs.anl.gov/pipermail/petsc-users/2020-July/041738.html >> <https://lists.mcs.anl.gov/pipermail/petsc-users/2020-July/041738.html>. >> Since it is not easy to use, I did not put it on PETSc FAQ. >> I also asked in the OpenMPI mailing list. An OpenMPI developer said he >> could make their tests public, and is in the process of checking with all >> authors to have a license :). If it is done, it will be at >> https://github.com/open-mpi/ompi-tests-public >> <https://github.com/open-mpi/ompi-tests-public> >> >> A test suite will be helpful but I doubt it will solve the problem. >> User's particular case (number of ranks, message size, communication pattern >> etc) might not be covered by a test suite. >> --Junchao Zhang >> >> >> On Fri, Aug 21, 2020 at 12:33 PM Barry Smith <bsm...@petsc.dev >> <mailto:bsm...@petsc.dev>> wrote: >> >> There really needs to be a usable extensive MPI test suite that can find >> these performance issues, we spend time helping users with these problems >> when it is really the MPI communities job. >> >> >> >>> On Aug 21, 2020, at 11:55 AM, Manav Bhatia <bhatiama...@gmail.com >>> <mailto:bhatiama...@gmail.com>> wrote: >>> >>> I built petsc with mpich-3.3.2 on my MacBook Pro with Apple clang 11.0.3 >>> and the test is finishing at my end. >>> >>> So, it appears that there is some issue with openmpi-4.0.1 on this machine. >>> >>> I will now build all my dependency toolchain with mpich and hopefully >>> things will work for my application code. >>> >>> Thank you again for your help. >>> >>> Regards, >>> Manav >>> >>> >>>> On Aug 20, 2020, at 10:45 PM, Junchao Zhang <junchao.zh...@gmail.com >>>> <mailto:junchao.zh...@gmail.com>> wrote: >>>> >>>> Manav, >>>> I downloaded your petsc_mat.tgz but could not reproduce the problem, on >>>> both Linux and Mac. I used the petsc commit id df0e4300 you mentioned. >>>> On Linux, I have openmpi-4.0.2 + gcc-8.3.0, and petsc is configured >>>> --with-debugging --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpifort >>>> --COPTFLAGS="-g -O0" --FOPTFLAGS="-g -O0" --CXXOPTFLAGS="-g -O0" >>>> --PETSC_ARCH=linux-host-dbg >>>> On Mac, I have mpich-3.3.1 + clang-11.0.0-apple, and petsc is configured >>>> --with-debugging=1 --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpifort >>>> --with-ctable=0 COPTFLAGS="-O0 -g" CXXOPTFLAGS="-O0 -g" >>>> PETSC_ARCH=mac-clang-dbg >>>> >>>> mpirun -n 8 ./test >>>> rank: 1 : stdout.processor.1 >>>> rank: 4 : stdout.processor.4 >>>> rank: 0 : stdout.processor.0 >>>> rank: 5 : stdout.processor.5 >>>> rank: 6 : stdout.processor.6 >>>> rank: 7 : stdout.processor.7 >>>> rank: 3 : stdout.processor.3 >>>> rank: 2 : stdout.processor.2 >>>> rank: 1 : Beginning reading nnz... >>>> rank: 4 : Beginning reading nnz... >>>> rank: 0 : Beginning reading nnz... >>>> rank: 5 : Beginning reading nnz... >>>> rank: 7 : Beginning reading nnz... >>>> rank: 2 : Beginning reading nnz... >>>> rank: 3 : Beginning reading nnz... >>>> rank: 6 : Beginning reading nnz... >>>> rank: 5 : Finished reading nnz >>>> rank: 5 : Beginning mat preallocation... >>>> rank: 3 : Finished reading nnz >>>> rank: 3 : Beginning mat preallocation... >>>> rank: 4 : Finished reading nnz >>>> rank: 4 : Beginning mat preallocation... >>>> rank: 7 : Finished reading nnz >>>> rank: 7 : Beginning mat preallocation... >>>> rank: 1 : Finished reading nnz >>>> rank: 1 : Beginning mat preallocation... >>>> rank: 0 : Finished reading nnz >>>> rank: 0 : Beginning mat preallocation... >>>> rank: 2 : Finished reading nnz >>>> rank: 2 : Beginning mat preallocation... >>>> rank: 6 : Finished reading nnz >>>> rank: 6 : Beginning mat preallocation... >>>> rank: 5 : Finished preallocation >>>> rank: 5 : Beginning reading and setting matrix values... >>>> rank: 1 : Finished preallocation >>>> rank: 1 : Beginning reading and setting matrix values... >>>> rank: 7 : Finished preallocation >>>> rank: 7 : Beginning reading and setting matrix values... >>>> rank: 2 : Finished preallocation >>>> rank: 2 : Beginning reading and setting matrix values... >>>> rank: 4 : Finished preallocation >>>> rank: 4 : Beginning reading and setting matrix values... >>>> rank: 0 : Finished preallocation >>>> rank: 0 : Beginning reading and setting matrix values... >>>> rank: 3 : Finished preallocation >>>> rank: 3 : Beginning reading and setting matrix values... >>>> rank: 6 : Finished preallocation >>>> rank: 6 : Beginning reading and setting matrix values... >>>> rank: 1 : Finished reading and setting matrix values >>>> rank: 1 : Beginning mat assembly... >>>> rank: 5 : Finished reading and setting matrix values >>>> rank: 5 : Beginning mat assembly... >>>> rank: 4 : Finished reading and setting matrix values >>>> rank: 4 : Beginning mat assembly... >>>> rank: 2 : Finished reading and setting matrix values >>>> rank: 2 : Beginning mat assembly... >>>> rank: 3 : Finished reading and setting matrix values >>>> rank: 3 : Beginning mat assembly... >>>> rank: 7 : Finished reading and setting matrix values >>>> rank: 7 : Beginning mat assembly... >>>> rank: 6 : Finished reading and setting matrix values >>>> rank: 6 : Beginning mat assembly... >>>> rank: 0 : Finished reading and setting matrix values >>>> rank: 0 : Beginning mat assembly... >>>> rank: 1 : Finished mat assembly >>>> rank: 3 : Finished mat assembly >>>> rank: 7 : Finished mat assembly >>>> rank: 0 : Finished mat assembly >>>> rank: 5 : Finished mat assembly >>>> rank: 2 : Finished mat assembly >>>> rank: 4 : Finished mat assembly >>>> rank: 6 : Finished mat assembly >>>> >>>> --Junchao Zhang >>>> >>>> >>>> On Thu, Aug 20, 2020 at 5:29 PM Junchao Zhang <junchao.zh...@gmail.com >>>> <mailto:junchao.zh...@gmail.com>> wrote: >>>> I will have a look and report back to you. Thanks. >>>> --Junchao Zhang >>>> >>>> >>>> On Thu, Aug 20, 2020 at 5:23 PM Manav Bhatia <bhatiama...@gmail.com >>>> <mailto:bhatiama...@gmail.com>> wrote: >>>> I have created a standalone test that demonstrates the problem at my end. >>>> I have stored the indices, etc. from my problem in a text file for each >>>> rank, which I use to initialize the matrix. >>>> Please note that the test is specifically for 8 ranks. >>>> >>>> The .tgz file is on my google drive: >>>> https://drive.google.com/file/d/1R-WjS36av3maXX3pUyiR3ndGAxteTVj-/view?usp=sharing >>>> >>>> <https://drive.google.com/file/d/1R-WjS36av3maXX3pUyiR3ndGAxteTVj-/view?usp=sharing> >>>> >>>> >>>> This contains a README file with instructions on running. Please note that >>>> the work directory needs the index files. >>>> >>>> Please let me know if I can provide any further information. >>>> >>>> Thank you all for your help. >>>> >>>> Regards, >>>> Manav >>>> >>>>> On Aug 20, 2020, at 12:54 PM, Jed Brown <j...@jedbrown.org >>>>> <mailto:j...@jedbrown.org>> wrote: >>>>> >>>>> Matthew Knepley <knep...@gmail.com <mailto:knep...@gmail.com>> writes: >>>>> >>>>>> On Thu, Aug 20, 2020 at 11:09 AM Manav Bhatia <bhatiama...@gmail.com >>>>>> <mailto:bhatiama...@gmail.com>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Aug 20, 2020, at 8:31 AM, Stefano Zampini <stefano.zamp...@gmail.com >>>>>>> <mailto:stefano.zamp...@gmail.com>> >>>>>>> wrote: >>>>>>> >>>>>>> Can you add a MPI_Barrier before >>>>>>> >>>>>>> ierr = MatAssemblyBegin(aij->A,mode);CHKERRQ(ierr); >>>>>>> >>>>>>> >>>>>>> With a MPI_Barrier before this function call: >>>>>>> — three of the processes have already hit this barrier, >>>>>>> — the other 5 are inside MatStashScatterGetMesg_Private -> >>>>>>> MatStashScatterGetMesg_BTS -> MPI_Waitsome(2 processes)/MPI_Waitall(3 >>>>>>> processes) >>>>> >>>>> This is not itself evidence of inconsistent state. You can use >>>>> >>>>> -build_twosided allreduce >>>>> >>>>> to avoid the nonblocking sparse algorithm. >>>>> >>>>>> >>>>>> Okay, you should run this with -matstash_legacy just to make sure it is >>>>>> not >>>>>> a bug in your MPI implementation. But it looks like >>>>>> there is inconsistency in the parallel state. This can happen because we >>>>>> have a bug, or it could be that you called a collective >>>>>> operation on a subset of the processes. Is there any way you could cut >>>>>> down >>>>>> the example (say put all 1s in the matrix, etc) so >>>>>> that you could give it to us to run? >>>> >>> >> > > > > -- > What most experimenters take for granted before they begin their experiments > is infinitely more interesting than any results to which their experiments > lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>