Re: [petsc-users] How to understand these error messages

Fande Kong Mon, 21 Oct 2013 10:27:33 -0700

Hi Barry,

I contacted the supercomputer center, and they asked me for a test case so
that they can forward it to IBM. Is it possible that we write a test case
only using MPI?  It is not a good idea that we send them the whole petsc
code and my code.



On Sun, Oct 20, 2013 at 2:15 PM, Barry Smith <[email protected]> wrote:

>
>   It is unfortunate IBM has perpetuated this error in their libraries and
> made it worse.
>
>    You can, of course, work around it by making your application code far
> more complicated and manage the matrix assembly yourself but that is not a
> good use of your time or anyone's time. Plus how do you know that their
> bugging MPI won't bite you somewhere else, like in the new code you would
> need to write.
>
>    You need to report this bug to IBM and they need to take it seriously,
> unfortunately if you are not the purchaser of the IBM you are running on
> they may not care (companies only care about paying customers who complain).
>
>    Can you just use the Intel MPI mpi compilers/libraries? Or switch to
> some other system that is not from IBM? Better to just not use this machine
> until IBM straightens it out.
>
>    Barry
>
>
>
> On Oct 20, 2013, at 3:04 PM, Fande Kong <[email protected]> wrote:
>
> > This behaviour is really, really strange.
> >
> > The yellowstone supercomputer updated the IBM PE to version 1.3.0.4
> about two months ago (
> https://dailyb.cisl.ucar.edu/bulletins/yellowstone-outage-august-27-update-ibm-pe-and-lsf).
> I recompiled the petsc and my code as they suggested. Unfortunately, this
> problem reoccurs even with small the number of processors (512 cores).
> >
> > The problem only happened before with large the number of processors and
> large the size of problem, but now this problem occurs even with small the
> number of processors and small problem for using IBM MPI or intel MPI.
> >
> > The exactly same code can run on another supercomputer. I think the code
> matstash.c is really sensitive on the IBM PE. It is hard for me to fix it.
> Can we disable stash and then I send off-processors data by ourselves? Or
> can we attach a scatter to mat to exchange off-processors values?
> >
> > error messages:
> >
> > [76]PETSC ERROR: --------------------- Error Message
> ------------------------------------
> > [76]PETSC ERROR: Petsc has generated inconsistent data!
> > [76]PETSC ERROR: Negative MPI source: stash->nrecvs=5 i=7
> MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=371173!
> > [76]PETSC ERROR:
> ------------------------------------------------------------------------
> > [76]PETSC ERROR: Petsc Release Version 3.4.1, unknown
> > [76]PETSC ERROR: See docs/changes/index.html for recent updates.
> > [76]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
> > [76]PETSC ERROR: See docs/index.html for manual pages.
> > [76]PETSC ERROR:
> ------------------------------------------------------------------------
> > [76]PETSC ERROR: ./linearElasticity on a arch-linux2-cxx-opt named
> ys0623 by fandek Sat Oct 19 00:26:16 2013
> > [76]PETSC ERROR: Libraries linked from
> /glade/p/work/fandek/petsc/arch-linux2-cxx-opt/lib
> > [76]PETSC ERROR: Configure run at Fri Oct 18 23:57:35 2013
> > [76]PETSC ERROR: Configure options --with-valgrind=1
> --with-clanguage=cxx --with-shared-libraries=1 --with-dynamic-loading=1
> --download-f-blas-lapack=1 --with-mpi=1 --download-parmetis=1
> --download-metis=1 --with-64-bit-indices=1 --download-netcdf=1
> --download-exodusii=1 --download-ptscotch=1 --download-hdf5=1
> --with-debugging=no
> > [76]PETSC ERROR:
> ------------------------------------------------------------------------
> > [76]PETSC ERROR: MatStashScatterGetMesg_Private() line 633 in
> /glade/p/work/fandek/petsc/src/mat/utils/matstash.c
> > [76]PETSC ERROR: MatAssemblyEnd_MPIAIJ() line 676 in
> /glade/p/work/fandek/petsc/src/mat/impls/aij/mpi/mpiaij.c
> > [76]PETSC ERROR: MatAssemblyEnd() line 4939 in
> /glade/p/work/fandek/petsc/src/mat/interface/matrix.c
> > [76]PETSC ERROR: SpmcsDMMeshCreatVertexMatrix() line 65 in
> meshreorder.cpp
> > [76]PETSC ERROR: SpmcsDMMeshReOrderingMeshPoints() line 125 in
> meshreorder.cpp
> > [76]PETSC ERROR: CreateProblem() line 59 in preProcessSetUp.cpp
> > [76]PETSC ERROR: DMmeshInitialize() line 95 in mgInitialize.cpp
> > [76]PETSC ERROR: main() line 69 in linearElasticity3d.cpp
> > Abort(77) on node 76 (rank 76 in comm 1140850688): application called
> MPI_Abort(MPI_COMM_WORLD, 77) - process 76
> > ERROR: 0031-300  Forcing all remote tasks to exit due to exit code 1 in
> task 76
> >
> >
> > Thanks
> >
> >
> >
> > On Wed, Jun 26, 2013 at 8:56 AM, Jeff Hammond <[email protected]>
> wrote:
> > This concerns IBM PE-MPI on iDataPlex, which is likely based upon the
> cluster implementation of PAMI, which is a completely different code base
> from the PAMI Blue Gene implementation.  If you can reproduce it on Blue
> Gene/Q, I will care.
> >
> > As an IBM customer, NCAR is endowed with the ability to file bug reports
> directly with IBM related to the products they possess.  There is a link to
> their support system on http://www2.cisl.ucar.edu/resources/yellowstone,
> which is the appropriate channel for users of Yellowstone that have issues
> with the system software installed there.
> >
> > Jeff
> >
> > ----- Original Message -----
> > From: "Jed Brown" <[email protected]>
> > To: "Fande Kong" <[email protected]>, "petsc-users" <
> [email protected]>
> > Cc: "Jeff Hammond" <[email protected]>
> > Sent: Wednesday, June 26, 2013 9:21:48 AM
> > Subject: Re: [petsc-users] How to understand these error messages
> >
> > Fande Kong <[email protected]> writes:
> >
> > > Hi Barry,
> > >
> > > If I use the intel mpi, my code can correctly run and can produce some
> > > correct results. Yes, you are right. The IBM MPI has some bugs.
> >
> > Fande, please report this issue to the IBM.
> >
> > Jeff, Fande has a reproducible case where when running on 10k cores and
> > problem sizes over 100M, this
> >
> >   MPI_Waitany(2*stash->nrecvs,stash->recv_waits,&i,&recv_status);
> >
> > returns
> >
> >       [6724]PETSC ERROR: Negative MPI source: stash->nrecvs=8 i=11
> >       MPI_SOURCE=-32766 MPI_TAG=-32766 MPI_ERROR=20613892!
> >
> > It runs correctly for smaller problem sizes, smaller core counts, or for
> > all sizes when using Intel MPI.  This is on Yellowstone (iDataPlex, 4500
> > dx360 nodes).  Do you know someone at IBM that should be notified?
> >
> > --
> > Jeff Hammond
> > Argonne Leadership Computing Facility
> > University of Chicago Computation Institute
> > [email protected] / (630) 252-5381
> > http://www.linkedin.com/in/jeffhammond
> > https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> > ALCF docs: http://www.alcf.anl.gov/user-guides
> >
> >
>
>
>

Re: [petsc-users] How to understand these error messages

Reply via email to