On Sun, 19 Sep 1999, Camm Maguire wrote: > Greetings! I've found a quite reproducible bug in the above software > combination. The command > > mpirun -np 16 -O N xdinv > > consistently fails with N=2048,nb=16,nr=nc=4 somwhere in the routine > pdgetri, specifically in the loop from lines 285 to 306. Running with > the -lamd option to mpirun clears the problem, seeming to indicate lam > in the failure. The MPI routines report the following error: > > MPI_Recv: process in remote group is dead (rank 0, comm 3)
When running the xdlutime test program under lam 6.2b, I had a problem with large matrix sizes. It seemed to be caused by too small of a shared memory segment for the lam processes to communicate over. I don't have the xdinv program, but maybe it is the same thing? I set these two enviroment variables and they fixed my programs for xdlutime. export LAM_MPI_SHMPOOLSIZE=32505856 export LAM_MPI_SHMMAXALLOC=2097152 I think they need to be set when lamd starts up on all the nodes, which in effect means you will need to put them into your .bash_profile file. You can check by running "ipcs" and checking if the size of the shm segment is 16MB or 32MB.

