Hi,

I am getting some errors from a code that uses PETSc and SLEPc to diagonalise 
matrices in parallel. The code has been working fine on many machines but is 
giving problems on a Cray XT4 machine. The PETSc sparse matrix type MPIAIJ is 
used to store the matrix and then the SLEPc Krylov-Schur solver is used to 
iteratively diagonalise. For each run the dimension of the matrices 
diagonalised can vary wildly from tens or hundreds of rows to hundreds of 
millions of rows. Even though the smaller matrices can be computed easily on a 
single core I wanted to be able to perform all calculations from a single run. 
When running on thousands of processors SLEPc does not like it when you have 
more cores than rows in the matrix. To overcome this I create a new 
communicator with a sensible amount of cores before each diagonalisation and 
free it afterwards. When running four processors on four nodes of the cray XT4 
machine for the case of a single diagonalisation of a matrix of dimension 4096 
everything works however for a case with a single diagonalisation of a matrix 
with dimension 16.7 million the diagonalisation works correctly but the 
following errors are produced afterwards. 

Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567) 
failed
MPI_Attr_delete(86).: Invalid communicator
aborting job:
Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567) 
failed
MPI_Attr_delete(86).: Invalid communicator
aborting job:
Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567) 
failed
MPI_Attr_delete(86).: Invalid communicator
aborting job:
Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567) 
failed
MPI_Attr_delete(86).: Invalid communicator 

In both cases the new communicator created will be made up of all four 
processors. The only function that calls MPI_Attr_delete seems to be the 
MatDestroy function which is called after the diagonalisation. 
The structure of the code and order of relevant calls is as follows: 

SlepcInitialize(&argc,&argv,(char*)0,help); //SLEPc initialisation which in 
turn calls PETSc initialisation routine. 
loop over diagonalisations to be performed { 
Mat A; //matrix data structure

//code to create new communicator
MPI_Comm  comm_world = PETSC_COMM_WORLD;
MPI_Comm new_comm;
MPI_Group processes_being_used;
MPI_Group global_group;
int number_relevant_ranks = 0;
int *relevant_ranks;
//code to determine number and allocate and populate relevant_ranks array.
MPI_Comm_group(comm_world,&global_group);
MPI_Group_incl(global_group,number_relevant_ranks,relevant_ranks,&processes_being_used);
MPI_Comm_create(comm_world,processes_being_used,&new_comm);
MPI_Group_free(&processes_being_used);
MPI_Group_free(&global_group);

//code to create and populate matrix
ierr = MatCreate(new_comm,&A);CHKERRQ(ierr);
ierr = MatSetSizes(A, local_rows, PETSC_DECIDE, global_rows 
,global_columns);CHKERRQ(ierr);
ierr = MatSetType(A,MATMPIAIJ);CHKERRQ(ierr);
MatMPIAIJSetPreallocation(A,1,d_nnz ,0, o_nnz );CHKERRQ(ierr);
ierr = MatSetValues(A, 1, &row_idx, val_count, cols, values, ADD_VALUES); 
CHKERRQ(ierr);
ierr = MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
ierr = MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);

//code to create and run eigensolver
EPS eps;
ierr = EPSCreate(new_comm,&eps);CHKERRQ(ierr);
ierr = EPSSetOperators(eps,A,PETSC_NULL);CHKERRQ(ierr); //tell solver that A is 
the operator.
ierr = EPSSetProblemType(eps, EPS_HEP);CHKERRQ(ierr); //specify that this is a 
hermitian eigenproblem.
ierr = EPSSolve(eps);CHKERRQ(ierr); // run the eigen problem solver.
ierr = EPSGetConverged(eps,&eigenvalues_converged);CHKERRQ(ierr);
// retrieve eigenvalues and vectors.

//cleanup state
ierr = EPSDestroy(eps);CHKERRQ(ierr);
ierr = MatDestroy(A);CHKERRQ(ierr);
MPI_Comm_free(&new_comm);  //free the communicator
}

I have tried inserting a barrier between the MatDestroy and MPI_Comm_free to no 
avail and also added check to ensure the communicator is not null before 
calling MatDestroy. 
if ( new_comm != MPI_COMM_NULL ) ... 

At this stage I am confused as to how best to proceed. I have been considering 
adding a MACRO that will revert back to using PETSC_COMM_WORLD for everything. 
However the fact that the smaller size is working and not the larger one is 
confusing me. I have also considered memory errors. I do not have direct access 
to this machine and not sure how many debugging or memory checking tools can be 
used. Any suggestions or ideas are appreciated. 

Regards,

Niall. 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20100729/d97c881f/attachment.htm>

Reply via email to