Can you run with valgrind to determine if there is memory corruption? http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
Also check with Intel for any MPI updates. You can also try to call MatAssemblyBegin/End(mat,MAT_FLUSH_ASSEMBLY) several times during the generation of the matrix entries (this will make the messages smaller). Warning: all processes have to call MatAssemblyBegin/End(mat,MAT_FLUSH_ASSEMBLY) the same number of times. If this "solves" the problem then we know it is an issue with the MPI buffers. Barry > On Jan 22, 2015, at 9:17 AM, Antoine De Blois > <[email protected]> wrote: > > Hi Everyone, > > I get a strange error during a call to MatAssemblyBegin. The error message is > triggered by Intel MPI, as shown below. The error does not always occurs, > which is even more strange. > [333:node1179] unexpected disconnect completion event from [163:node1254] > Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0 > > All ranks output the same error message with their own node number. I did a > bit of research and some say that MPICH2 solves this issue. Since our group > is keen in using Intel MPI, I would like to solves this issue at the root. > > A few important points: > · At the moment, we are assembling the matrix with a single > MatAssembleBegin/End and MAT_FINAL_ASSEMBLY after doing MatSetValuesBlocked. > Can it be due to memory overflow in the buffers? > · We are using -genv I_MPI_FABRICS shm:dapl in the submission script > · I tried using –malloc_log and –log_summary, but the crash prevents > writing the log ouput > > Has anyone of you already faced this issue? > Any suggestion is welcome, > Best regards, > Antoine DeBlois > > Antoine DeBlois > Specialiste ingenierie, MDO lead / Engineering Specialist, MDO lead > Aéronautique / Aerospace > 514-855-5001, x 50862 > [email protected] > > 2351 Blvd Alfred-Nobel > Montreal, Qc > H4S 1A9 > > <image001.jpg> > CONFIDENTIALITY NOTICE - This communication may contain privileged or > confidential information. > If you are not the intended recipient or received this communication by > error, please notify the sender > and delete the message without copying
