Hi,
I am getting some errors from a code that uses PETSc and SLEPc to diagonalise
matrices in parallel. The code has been working fine on many machines but is
giving problems on a Cray XT4 machine. The PETSc sparse matrix type MPIAIJ is
used to store the matrix and then the SLEPc Krylov-Schur solver is used to
iteratively diagonalise. For each run the dimension of the matrices
diagonalised can vary wildly from tens or hundreds of rows to hundreds of
millions of rows. Even though the smaller matrices can be computed easily on a
single core I wanted to be able to perform all calculations from a single run.
When running on thousands of processors SLEPc does not like it when you have
more cores than rows in the matrix. To overcome this I create a new
communicator with a sensible amount of cores before each diagonalisation and
free it afterwards. When running four processors on four nodes of the cray XT4
machine for the case of a single diagonalisation of a matrix of dimension 4096
everything works however for a case with a single diagonalisation of a matrix
with dimension 16.7 million the diagonalisation works correctly but the
following errors are produced afterwards.
Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567)
failed
MPI_Attr_delete(86).: Invalid communicator
aborting job:
Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567)
failed
MPI_Attr_delete(86).: Invalid communicator
aborting job:
Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567)
failed
MPI_Attr_delete(86).: Invalid communicator
aborting job:
Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567)
failed
MPI_Attr_delete(86).: Invalid communicator
In both cases the new communicator created will be made up of all four
processors. The only function that calls MPI_Attr_delete seems to be the
MatDestroy function which is called after the diagonalisation.
The structure of the code and order of relevant calls is as follows:
SlepcInitialize(&argc,&argv,(char*)0,help); //SLEPc initialisation which in
turn calls PETSc initialisation routine.
loop over diagonalisations to be performed {
Mat A; //matrix data structure
//code to create new communicator
MPI_Comm comm_world = PETSC_COMM_WORLD;
MPI_Comm new_comm;
MPI_Group processes_being_used;
MPI_Group global_group;
int number_relevant_ranks = 0;
int *relevant_ranks;
//code to determine number and allocate and populate relevant_ranks array.
MPI_Comm_group(comm_world,&global_group);
MPI_Group_incl(global_group,number_relevant_ranks,relevant_ranks,&processes_being_used);
MPI_Comm_create(comm_world,processes_being_used,&new_comm);
MPI_Group_free(&processes_being_used);
MPI_Group_free(&global_group);
//code to create and populate matrix
ierr = MatCreate(new_comm,&A);CHKERRQ(ierr);
ierr = MatSetSizes(A, local_rows, PETSC_DECIDE, global_rows
,global_columns);CHKERRQ(ierr);
ierr = MatSetType(A,MATMPIAIJ);CHKERRQ(ierr);
MatMPIAIJSetPreallocation(A,1,d_nnz ,0, o_nnz );CHKERRQ(ierr);
ierr = MatSetValues(A, 1, &row_idx, val_count, cols, values, ADD_VALUES);
CHKERRQ(ierr);
ierr = MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
ierr = MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
//code to create and run eigensolver
EPS eps;
ierr = EPSCreate(new_comm,&eps);CHKERRQ(ierr);
ierr = EPSSetOperators(eps,A,PETSC_NULL);CHKERRQ(ierr); //tell solver that A is
the operator.
ierr = EPSSetProblemType(eps, EPS_HEP);CHKERRQ(ierr); //specify that this is a
hermitian eigenproblem.
ierr = EPSSolve(eps);CHKERRQ(ierr); // run the eigen problem solver.
ierr = EPSGetConverged(eps,&eigenvalues_converged);CHKERRQ(ierr);
// retrieve eigenvalues and vectors.
//cleanup state
ierr = EPSDestroy(eps);CHKERRQ(ierr);
ierr = MatDestroy(A);CHKERRQ(ierr);
MPI_Comm_free(&new_comm); //free the communicator
}
I have tried inserting a barrier between the MatDestroy and MPI_Comm_free to no
avail and also added check to ensure the communicator is not null before
calling MatDestroy.
if ( new_comm != MPI_COMM_NULL ) ...
At this stage I am confused as to how best to proceed. I have been considering
adding a MACRO that will revert back to using PETSC_COMM_WORLD for everything.
However the fact that the smaller size is working and not the larger one is
confusing me. I have also considered memory errors. I do not have direct access
to this machine and not sure how many debugging or memory checking tools can be
used. Any suggestions or ideas are appreciated.
Regards,
Niall.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20100729/d97c881f/attachment.htm>