On Oct 17, 2012, at 3:23 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
> No. The problem is that Open MPI does not fix their critical bugs, they just > downgrade them from "blocker" so they can make a release. It's not easy for > them to fix because they need to refactor some lower level protocols. Ahh yes. It is nice to have a sophisticated bug tracking system; it makes it easy to relabel bugs with a single click to avoid doing work. We should add this to PETSc :-) BTW: Shouldn't you have configure detect this issue and turnoff the building of SF or print appropriate error messages so we don't get these confusing petsc-maint that only you can understand? Barry > > On Wed, Oct 17, 2012 at 3:17 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: > > Could this problem be related to the fact that ALL mpi implementations do > not properly handle large counts that in 64 bit chunks (MPI_DOUBLE) fit into > the int but when multiplied by 8 to convert to bytes no longer fit into int > and thus rollover and screw up the MPI implementations. The reason I needed > to implement MPIULong_Send() and use it in few places. > > I know it is a different circumstance but it has the same symptom of > failing for big matrices but not small. > > > Barry > > > > Begin forwarded message: > >> From: Thomas Witkowski <thomas.witkowski at tu-dresden.de> >> Subject: Re: [petsc-users] MatTransposeMatMult ends up with an MPI error >> Date: October 17, 2012 2:57:05 PM CDT >> To: petsc-users at mcs.anl.gov >> Reply-To: PETSc users list <petsc-users at mcs.anl.gov> >> >> Am 17.10.2012 17:50, schrieb Hong Zhang: >>> Thomas: >>> >>> Does this occur only for large matrices? >>> Can you dump your matrices into petsc binary files >>> (e.g., A.dat, B.dat) and send to us for debugging? >>> >>> Lately, we added a new implementation of MatTransposeMatMult() in petsc-dev >>> which is shown much faster than released MatTransposeMatMult(). >>> You might give it a try by >>> 1. install petsc-dev (see >>> http://www.mcs.anl.gov/petsc/developers/index.html) >>> 2. run your code with option '-mattransposematmult_viamatmatmult 1' >>> Let us know what you get. >>> >> I checked the problem with petsc-dev. Here, the code just hangs somewhere >> inside MatTransposeMatMult. I checked, what MatTranspose does on the >> corresponding matrix and the behavior is the same. I extracted the matrix >> from my simulations, its of size 123,432 x 1,533,726 and very sparse (2 to 8 >> nnzs per row). I'm sorry, but this is the smallest matrix where I found the >> problem (I will send the matrix file to petsc-maint). I wrote some small >> piece of code, that just reads the matrix and runs MatTranspose. With 1 mpi >> task, it works fine. With small number of mpi tasks (so around 8), I get the >> following error message: >> >> [1]PETSC ERROR: >> ------------------------------------------------------------------------ >> [1]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the >> batch system) has told this process to end >> [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger >> [1]PETSC ERROR: or see >> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSC ERROR: >> or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory >> corruption errors >> [1]PETSC ERROR: likely location of problem given in stack below >> [1]PETSC ERROR: --------------------- Stack Frames >> ------------------------------------ >> [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, >> [1]PETSC ERROR: INSTEAD the line number of the start of the function >> [1]PETSC ERROR: is given. >> [1]PETSC ERROR: [1] PetscSFReduceEnd line 1259 src/sys/sf/sf.c >> [1]PETSC ERROR: [1] MatTranspose_MPIAIJ line 2045 >> src/mat/impls/aij/mpi/mpiaij.c >> [1]PETSC ERROR: [1] MatTranspose line 4341 src/mat/interface/matrix.c >> >> >> With 32 mpi tasks, which I also use in my simulation, the code hangs in >> MatTranspose. >> >> If there is something more I can do to help you finding the problem, please >> let me know! >> >> Thomas >> >>> Hong >>> >>> My code makes use of the function MatTransposeMatMult, and usually it work >>> fine! For some larger input data, it now stops with a lot of MPI errors: >>> >>> fatal error in PMPI_Barrier: Other MPI error, error stack: >>> PMPI_Barrier(476)..: MPI_Barrier(comm=0x84000001) failed >>> MPIR_Barrier(82)...: >>> MPI_Waitall(261): MPI_Waitall(count=9, req_array=0xa787ba0, >>> status_array=0xa789240) failed >>> MPI_Waitall(113): The supplied request in array element 8 was invalid >>> (kind=0) >>> Fatal error in PMPI_Barrier: Other MPI error, error stack: >>> PMPI_Barrier(476)..: MPI_Barrier(comm=0x84000001) failed >>> MPIR_Barrier(82)...: >>> mpid_irecv_done(98): read from socket failed - request state:recv(pde)done >>> >>> >>> Here is the stack print from the debugger: >>> >>> 6, MatTransposeMatMult (matrix.c:8907) >>> 6, MatTransposeMatMult_MPIAIJ_MPIAIJ (mpimatmatmult.c:809) >>> 6, MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ >>> (mpimatmatmult.c:1136) >>> 6, PetscGatherMessageLengths2 (mpimesg.c:213) >>> 6, PMPI_Waitall >>> 6, MPIR_Err_return_comm >>> 6, MPID_Abort >>> >>> >>> I use PETSc 3.3-p3. Any idea whether this is or could be related to some >>> bug in PETSc or whether I make wrong use of the function in some way? >>> >>> Thomas >>> >>> >> >> > >
