On 29/07/2010, Niall Moran wrote:
> Hi,
>
> I am getting some errors from a code that uses PETSc and SLEPc to diagonalise
> matrices in parallel. The code has been working fine on many machines but is
> giving problems on a Cray XT4 machine. The PETSc sparse matrix type MPIAIJ is
> used to store the matrix and then the SLEPc Krylov-Schur solver is used to
> iteratively diagonalise. For each run the dimension of the matrices
> diagonalised can vary wildly from tens or hundreds of rows to hundreds of
> millions of rows. Even though the smaller matrices can be computed easily on
> a single core I wanted to be able to perform all calculations from a single
> run. When running on thousands of processors SLEPc does not like it when you
> have more cores than rows in the matrix.
In slepc-dev I have made a fix for the case when the number of rows assigned to
one of the processes is zero. In slepc-3.0.0 I don't see this problem.
Jose
> To overcome this I create a new communicator with a sensible amount of cores
> before each diagonalisation and free it afterwards. When running four
> processors on four nodes of the cray XT4 machine for the case of a single
> diagonalisation of a matrix of dimension 4096 everything works however for a
> case with a single diagonalisation of a matrix with dimension 16.7 million
> the diagonalisation works correctly but the following errors are produced
> afterwards.
>
> Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
> MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567)
> failed
> MPI_Attr_delete(86).: Invalid communicator
> aborting job:
> Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
> MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567)
> failed
> MPI_Attr_delete(86).: Invalid communicator
> aborting job:
> Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
> MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567)
> failed
> MPI_Attr_delete(86).: Invalid communicator
> aborting job:
> Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
> MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567)
> failed
> MPI_Attr_delete(86).: Invalid communicator
>
> In both cases the new communicator created will be made up of all four
> processors. The only function that calls MPI_Attr_delete seems to be the
> MatDestroy function which is called after the diagonalisation.
> The structure of the code and order of relevant calls is as follows:
>
> SlepcInitialize(&argc,&argv,(char*)0,help); //SLEPc initialisation which in
> turn calls PETSc initialisation routine.
> loop over diagonalisations to be performed {
> Mat A; //matrix data structure
>
> //code to create new communicator
> MPI_Comm comm_world = PETSC_COMM_WORLD;
> MPI_Comm new_comm;
> MPI_Group processes_being_used;
> MPI_Group global_group;
> int number_relevant_ranks = 0;
> int *relevant_ranks;
> //code to determine number and allocate and populate relevant_ranks array.
> MPI_Comm_group(comm_world,&global_group);
> MPI_Group_incl(global_group,number_relevant_ranks,relevant_ranks,&processes_being_used);
> MPI_Comm_create(comm_world,processes_being_used,&new_comm);
> MPI_Group_free(&processes_being_used);
> MPI_Group_free(&global_group);
>
> //code to create and populate matrix
> ierr = MatCreate(new_comm,&A);CHKERRQ(ierr);
> ierr = MatSetSizes(A, local_rows, PETSC_DECIDE, global_rows
> ,global_columns);CHKERRQ(ierr);
> ierr = MatSetType(A,MATMPIAIJ);CHKERRQ(ierr);
> MatMPIAIJSetPreallocation(A,1,d_nnz ,0, o_nnz );CHKERRQ(ierr);
> ierr = MatSetValues(A, 1, &row_idx, val_count, cols, values, ADD_VALUES);
> CHKERRQ(ierr);
> ierr = MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
> ierr = MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
>
> //code to create and run eigensolver
> EPS eps;
> ierr = EPSCreate(new_comm,&eps);CHKERRQ(ierr);
> ierr = EPSSetOperators(eps,A,PETSC_NULL);CHKERRQ(ierr); //tell solver that A
> is the operator.
> ierr = EPSSetProblemType(eps, EPS_HEP);CHKERRQ(ierr); //specify that this is
> a hermitian eigenproblem.
> ierr = EPSSolve(eps);CHKERRQ(ierr); // run the eigen problem solver.
> ierr = EPSGetConverged(eps,&eigenvalues_converged);CHKERRQ(ierr);
> // retrieve eigenvalues and vectors.
>
> //cleanup state
> ierr = EPSDestroy(eps);CHKERRQ(ierr);
> ierr = MatDestroy(A);CHKERRQ(ierr);
> MPI_Comm_free(&new_comm); //free the communicator
> }
>
> I have tried inserting a barrier between the MatDestroy and MPI_Comm_free to
> no avail and also added check to ensure the communicator is not null before
> calling MatDestroy.
> if ( new_comm != MPI_COMM_NULL ) ...
>
> At this stage I am confused as to how best to proceed. I have been
> considering adding a MACRO that will revert back to using PETSC_COMM_WORLD
> for everything. However the fact that the smaller size is working and not the
> larger one is confusing me. I have also considered memory errors. I do not
> have direct access to this machine and not sure how many debugging or memory
> checking tools can be used. Any suggestions or ideas are appreciated.
>
> Regards,
>
> Niall.
>
>