On 29/07/2010, Niall Moran wrote:

> Hi,
> 
> I am getting some errors from a code that uses PETSc and SLEPc to diagonalise 
> matrices in parallel. The code has been working fine on many machines but is 
> giving problems on a Cray XT4 machine. The PETSc sparse matrix type MPIAIJ is 
> used to store the matrix and then the SLEPc Krylov-Schur solver is used to 
> iteratively diagonalise. For each run the dimension of the matrices 
> diagonalised can vary wildly from tens or hundreds of rows to hundreds of 
> millions of rows. Even though the smaller matrices can be computed easily on 
> a single core I wanted to be able to perform all calculations from a single 
> run. When running on thousands of processors SLEPc does not like it when you 
> have more cores than rows in the matrix.

In slepc-dev I have made a fix for the case when the number of rows assigned to 
one of the processes is zero. In slepc-3.0.0 I don't see this problem.
Jose

> To overcome this I create a new communicator with a sensible amount of cores 
> before each diagonalisation and free it afterwards. When running four 
> processors on four nodes of the cray XT4 machine for the case of a single 
> diagonalisation of a matrix of dimension 4096 everything works however for a 
> case with a single diagonalisation of a matrix with dimension 16.7 million 
> the diagonalisation works correctly but the following errors are produced 
> afterwards. 
> 
> Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
> MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567) 
> failed
> MPI_Attr_delete(86).: Invalid communicator
> aborting job:
> Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
> MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567) 
> failed
> MPI_Attr_delete(86).: Invalid communicator
> aborting job:
> Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
> MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567) 
> failed
> MPI_Attr_delete(86).: Invalid communicator
> aborting job:
> Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
> MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567) 
> failed
> MPI_Attr_delete(86).: Invalid communicator 
> 
> In both cases the new communicator created will be made up of all four 
> processors. The only function that calls MPI_Attr_delete seems to be the 
> MatDestroy function which is called after the diagonalisation. 
> The structure of the code and order of relevant calls is as follows: 
> 
> SlepcInitialize(&argc,&argv,(char*)0,help); //SLEPc initialisation which in 
> turn calls PETSc initialisation routine. 
> loop over diagonalisations to be performed { 
> Mat A; //matrix data structure
> 
> //code to create new communicator
> MPI_Comm  comm_world = PETSC_COMM_WORLD;
> MPI_Comm new_comm;
> MPI_Group processes_being_used;
> MPI_Group global_group;
> int number_relevant_ranks = 0;
> int *relevant_ranks;
> //code to determine number and allocate and populate relevant_ranks array.
> MPI_Comm_group(comm_world,&global_group);
> MPI_Group_incl(global_group,number_relevant_ranks,relevant_ranks,&processes_being_used);
> MPI_Comm_create(comm_world,processes_being_used,&new_comm);
> MPI_Group_free(&processes_being_used);
> MPI_Group_free(&global_group);
> 
> //code to create and populate matrix
> ierr = MatCreate(new_comm,&A);CHKERRQ(ierr);
> ierr = MatSetSizes(A, local_rows, PETSC_DECIDE, global_rows 
> ,global_columns);CHKERRQ(ierr);
> ierr = MatSetType(A,MATMPIAIJ);CHKERRQ(ierr);
> MatMPIAIJSetPreallocation(A,1,d_nnz ,0, o_nnz );CHKERRQ(ierr);
> ierr = MatSetValues(A, 1, &row_idx, val_count, cols, values, ADD_VALUES); 
> CHKERRQ(ierr);
> ierr = MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
> ierr = MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
> 
> //code to create and run eigensolver
> EPS eps;
> ierr = EPSCreate(new_comm,&eps);CHKERRQ(ierr);
> ierr = EPSSetOperators(eps,A,PETSC_NULL);CHKERRQ(ierr); //tell solver that A 
> is the operator.
> ierr = EPSSetProblemType(eps, EPS_HEP);CHKERRQ(ierr); //specify that this is 
> a hermitian eigenproblem.
> ierr = EPSSolve(eps);CHKERRQ(ierr); // run the eigen problem solver.
> ierr = EPSGetConverged(eps,&eigenvalues_converged);CHKERRQ(ierr);
> // retrieve eigenvalues and vectors.
> 
> //cleanup state
> ierr = EPSDestroy(eps);CHKERRQ(ierr);
> ierr = MatDestroy(A);CHKERRQ(ierr);
> MPI_Comm_free(&new_comm);  //free the communicator
> }
> 
> I have tried inserting a barrier between the MatDestroy and MPI_Comm_free to 
> no avail and also added check to ensure the communicator is not null before 
> calling MatDestroy. 
> if ( new_comm != MPI_COMM_NULL ) ... 
> 
> At this stage I am confused as to how best to proceed. I have been 
> considering adding a MACRO that will revert back to using PETSC_COMM_WORLD 
> for everything. However the fact that the smaller size is working and not the 
> larger one is confusing me. I have also considered memory errors. I do not 
> have direct access to this machine and not sure how many debugging or memory 
> checking tools can be used. Any suggestions or ideas are appreciated. 
> 
> Regards,
> 
> Niall. 
> 
> 

Reply via email to