Hi Barry, I contacted the supercomputer center, and they asked me for a test case so that they can forward it to IBM. Is it possible that we write a test case only using MPI? It is not a good idea that we send them the whole petsc code and my code.
On Sun, Oct 20, 2013 at 2:15 PM, Barry Smith <[email protected]> wrote: > > It is unfortunate IBM has perpetuated this error in their libraries and > made it worse. > > You can, of course, work around it by making your application code far > more complicated and manage the matrix assembly yourself but that is not a > good use of your time or anyone's time. Plus how do you know that their > bugging MPI won't bite you somewhere else, like in the new code you would > need to write. > > You need to report this bug to IBM and they need to take it seriously, > unfortunately if you are not the purchaser of the IBM you are running on > they may not care (companies only care about paying customers who complain). > > Can you just use the Intel MPI mpi compilers/libraries? Or switch to > some other system that is not from IBM? Better to just not use this machine > until IBM straightens it out. > > Barry > > > > On Oct 20, 2013, at 3:04 PM, Fande Kong <[email protected]> wrote: > > > This behaviour is really, really strange. > > > > The yellowstone supercomputer updated the IBM PE to version 1.3.0.4 > about two months ago ( > https://dailyb.cisl.ucar.edu/bulletins/yellowstone-outage-august-27-update-ibm-pe-and-lsf). > I recompiled the petsc and my code as they suggested. Unfortunately, this > problem reoccurs even with small the number of processors (512 cores). > > > > The problem only happened before with large the number of processors and > large the size of problem, but now this problem occurs even with small the > number of processors and small problem for using IBM MPI or intel MPI. > > > > The exactly same code can run on another supercomputer. I think the code > matstash.c is really sensitive on the IBM PE. It is hard for me to fix it. > Can we disable stash and then I send off-processors data by ourselves? Or > can we attach a scatter to mat to exchange off-processors values? > > > > error messages: > > > > [76]PETSC ERROR: --------------------- Error Message > ------------------------------------ > > [76]PETSC ERROR: Petsc has generated inconsistent data! > > [76]PETSC ERROR: Negative MPI source: stash->nrecvs=5 i=7 > MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=371173! > > [76]PETSC ERROR: > ------------------------------------------------------------------------ > > [76]PETSC ERROR: Petsc Release Version 3.4.1, unknown > > [76]PETSC ERROR: See docs/changes/index.html for recent updates. > > [76]PETSC ERROR: See docs/faq.html for hints about trouble shooting. > > [76]PETSC ERROR: See docs/index.html for manual pages. > > [76]PETSC ERROR: > ------------------------------------------------------------------------ > > [76]PETSC ERROR: ./linearElasticity on a arch-linux2-cxx-opt named > ys0623 by fandek Sat Oct 19 00:26:16 2013 > > [76]PETSC ERROR: Libraries linked from > /glade/p/work/fandek/petsc/arch-linux2-cxx-opt/lib > > [76]PETSC ERROR: Configure run at Fri Oct 18 23:57:35 2013 > > [76]PETSC ERROR: Configure options --with-valgrind=1 > --with-clanguage=cxx --with-shared-libraries=1 --with-dynamic-loading=1 > --download-f-blas-lapack=1 --with-mpi=1 --download-parmetis=1 > --download-metis=1 --with-64-bit-indices=1 --download-netcdf=1 > --download-exodusii=1 --download-ptscotch=1 --download-hdf5=1 > --with-debugging=no > > [76]PETSC ERROR: > ------------------------------------------------------------------------ > > [76]PETSC ERROR: MatStashScatterGetMesg_Private() line 633 in > /glade/p/work/fandek/petsc/src/mat/utils/matstash.c > > [76]PETSC ERROR: MatAssemblyEnd_MPIAIJ() line 676 in > /glade/p/work/fandek/petsc/src/mat/impls/aij/mpi/mpiaij.c > > [76]PETSC ERROR: MatAssemblyEnd() line 4939 in > /glade/p/work/fandek/petsc/src/mat/interface/matrix.c > > [76]PETSC ERROR: SpmcsDMMeshCreatVertexMatrix() line 65 in > meshreorder.cpp > > [76]PETSC ERROR: SpmcsDMMeshReOrderingMeshPoints() line 125 in > meshreorder.cpp > > [76]PETSC ERROR: CreateProblem() line 59 in preProcessSetUp.cpp > > [76]PETSC ERROR: DMmeshInitialize() line 95 in mgInitialize.cpp > > [76]PETSC ERROR: main() line 69 in linearElasticity3d.cpp > > Abort(77) on node 76 (rank 76 in comm 1140850688): application called > MPI_Abort(MPI_COMM_WORLD, 77) - process 76 > > ERROR: 0031-300 Forcing all remote tasks to exit due to exit code 1 in > task 76 > > > > > > Thanks > > > > > > > > On Wed, Jun 26, 2013 at 8:56 AM, Jeff Hammond <[email protected]> > wrote: > > This concerns IBM PE-MPI on iDataPlex, which is likely based upon the > cluster implementation of PAMI, which is a completely different code base > from the PAMI Blue Gene implementation. If you can reproduce it on Blue > Gene/Q, I will care. > > > > As an IBM customer, NCAR is endowed with the ability to file bug reports > directly with IBM related to the products they possess. There is a link to > their support system on http://www2.cisl.ucar.edu/resources/yellowstone, > which is the appropriate channel for users of Yellowstone that have issues > with the system software installed there. > > > > Jeff > > > > ----- Original Message ----- > > From: "Jed Brown" <[email protected]> > > To: "Fande Kong" <[email protected]>, "petsc-users" < > [email protected]> > > Cc: "Jeff Hammond" <[email protected]> > > Sent: Wednesday, June 26, 2013 9:21:48 AM > > Subject: Re: [petsc-users] How to understand these error messages > > > > Fande Kong <[email protected]> writes: > > > > > Hi Barry, > > > > > > If I use the intel mpi, my code can correctly run and can produce some > > > correct results. Yes, you are right. The IBM MPI has some bugs. > > > > Fande, please report this issue to the IBM. > > > > Jeff, Fande has a reproducible case where when running on 10k cores and > > problem sizes over 100M, this > > > > MPI_Waitany(2*stash->nrecvs,stash->recv_waits,&i,&recv_status); > > > > returns > > > > [6724]PETSC ERROR: Negative MPI source: stash->nrecvs=8 i=11 > > MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=20613892! > > > > It runs correctly for smaller problem sizes, smaller core counts, or for > > all sizes when using Intel MPI. This is on Yellowstone (iDataPlex, 4500 > > dx360 nodes). Do you know someone at IBM that should be notified? > > > > -- > > Jeff Hammond > > Argonne Leadership Computing Facility > > University of Chicago Computation Institute > > [email protected] / (630) 252-5381 > > http://www.linkedin.com/in/jeffhammond > > https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond > > ALCF docs: http://www.alcf.anl.gov/user-guides > > > > > > >
