Thanks. I apologize for any stupid assumptions/mistakes. One thing that became clear to me was to get the machinery working first, improve later. I respond to each of your questions below them.
On Wed, Jul 8, 2015 at 4:27 PM Barry Smith <[email protected]> wrote: > > Please provide a bit more detail about the operator. Is it a pressure > solver for CFD? Cell centered? Does the matrix has a null space of the > constant functions? The operator.. yes it's a pressure solver for a PISO/SIMPLE type algorithm. The underlying code is OpenFOAM. Reg. null space of constant functions, let me verify ... Do you mean Ax = b becomes A (x_1 + x_2) = (c+d) where Ax_1 = c and A x_2 = d where d is constant or zero right? The basic equation is a poisson equation, contains only the 2nd derivative. So I suppose if the entire solution is offset by a constant, it would still satisfy the equations. However, the problem does have Dirichlet boundary conditions on one boundary. So I'm thinking no null space. > Is it the same linear system for each "time-step?" in the CFD solver or > different? > Different. The structure of the matrix doesn't change every time step (for now), however the values do. How many iterations is hypre BoomerAMG typically taking? > Around 5 after I relaxed the tolerances (Used to be around 10-13 when I had the tighter tolerances). I solve the system of equations 3 times per "time-step" like here. Mat Object: 256 MPI processes type: mpiaij rows=12254823, cols=12254823 total: nonzeros=7.65502e+07, allocated nonzeros=1.83822e+08 total number of mallocs used during MatSetValues calls =0 not using I-node (on process 0) routines simple.corrNonOrtho() = 1 simple.nNonOrthCorr() = 2 1 KSP Residual norm 1.000000000000e+00 End residual = 0.000816181009378 Number of iterations = 4 simple.corrNonOrtho() = 2 simple.nNonOrthCorr() = 2 1 KSP Residual norm 1.000000000000e+00 End residual = 0.00118562245298 Number of iterations = 3 simple.corrNonOrtho() = 3 simple.nNonOrthCorr() = 2 1 KSP Residual norm 1.000000000000e+00 End residual = 1.15194716042e-06 Number of iterations = 5 This is an extremely tight tolerance, why do you set it so small? > KSPSetTolerances(ksp, 1e-13, 1e-13, PETSC_DEFAULT, PETSC_DEFAULT) ; > Well, that was just a first start.. wanted to make sure it worked.. But reducing the required tolerance to 1e-5, 1e-5 also doesn't change the run time by much. 5 "time-steps" now take 250s compared to 260s earlier. As a comparison, I can use OpenFOAM's inbuilt PCG solver + DIC preconditioner to do the same in about 13s (Thirteen). Sure the convergence tolerances are not quite the same, but I don't think that's the issue here. I'm trying to do one better OpenFOAM's existing solver suite. I think a good first step would be to match it. Technically I'm comparing apples and oranges by comparing PCG + DIC vs. FGMRES + AMG, however, I was hoping I could better. I'm open to changing pretty much anything. > > Send the output from a run with -log_summary so we can see where the > time is being spent. > Thanks for this. This is below. ************************************************************************************************************************ *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document *** ************************************************************************************************************************ ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- /scratch/02504/ganesh10/OpenFOAM/ganesh10-2.1.x/platforms/linux64IccDPOpt/bin/SRFSimpleFoamPETSC on a sandybridge named c415-502.stampede.tacc.utexas.edu with 256 processors, by ganesh10 Thu Jul 9 13:29:19 2015 Using Petsc Release Version 3.5.3, Jan, 31, 2015 Max Max/Min Avg Total Time (sec): 2.473e+02 1.00143 2.472e+02 Objects: 5.210e+02 1.00000 5.210e+02 Flops: 1.256e+08 1.17291 1.184e+08 3.031e+10 Flops/sec: 5.086e+05 1.17366 4.790e+05 1.226e+08 MPI Messages: 3.956e+03 21.50000 1.167e+03 2.986e+05 MPI Message Lengths: 6.769e+06 4.61934 3.787e+03 1.131e+09 MPI Reductions: 3.680e+02 1.00000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flops and VecAXPY() for complex vectors of length N --> 8N flops Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 2.4720e+02 100.0% 3.0311e+10 100.0% 2.986e+05 100.0% 3.787e+03 100.0% 3.670e+02 99.7% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flops: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent Avg. len: average message length (bytes) Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flops in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) ------------------------------------------------------------------------------------------------------------------------ Event Count Time (sec) Flops --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s ------------------------------------------------------------------------------------------------------------------------ --- Event Stage 0: Main Stage ------------------------------------------------------------------------------------------------------------------------ Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. Reports information only for process 0. --- Event Stage 0: Main Stage Vector 484 482 184328584 0 Vector Scatter 1 1 1060 0 Matrix 3 3 10175808 0 Index Set 2 2 38884 0 Viewer 1 0 0 0 Krylov Solver 15 15 284400 0 Preconditioner 15 15 16440 0 ======================================================================================================================== Average time to get PetscTime(): 9.53674e-08 Average time for MPI_Barrier(): 1.64032e-05 Average time for zero size MPI_Send(): 0.000232594 #PETSc Option Table entries: -info blah -log_summary -mat_view ::ascii_info -parallel #End of PETSc Option Table entries Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4 Configure options: --with-x=0 -with-pic --with-external-packages-dir=/opt/apps/intel13/mvapich2_1_9/petsc/3.5/externalpackages --with-mpi-compilers=1 --with-mpi-dir=/opt/apps/intel13/mvapich2/1.9 --with-scalar-type=real --with-shared-libraries=1 --with-precision=double --with-hypre=1 --download-hypre --with-ml=1 --download-ml --with-ml=1 --download-ml --with-superlu_dist=1 --download-superlu_dist --with-superlu=1 --download-superlu --with-parmetis=1 --download-parmetis --with-metis=1 --download-metis --with-spai=1 --download-spai --with-mumps=1 --download-mumps --with-parmetis=1 --download-parmetis --with-metis=1 --download-metis --with-scalapack=1 --download-scalapack --with-blacs=1 --download-blacs --with-spooles=1 --download-spooles --with-hdf5=1 --with-hdf5-dir=/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9 --with-debugging=no --with-blas-lapack-dir=/opt/apps/intel/13/composer_xe_2013.2.146/mkl --with-mpiexec=mpirun_rsh --COPTFLAGS= --FOPTFLAGS= --CXXOPTFLAGS= ----------------------------------------- Libraries compiled on Thu Apr 2 10:06:57 2015 on staff.stampede.tacc.utexas.edu Machine characteristics: Linux-2.6.32-431.17.1.el6.x86_64-x86_64-with-centos-6.6-Final Using PETSc directory: /opt/apps/intel13/mvapich2_1_9/petsc/3.5 Using PETSc arch: sandybridge ----------------------------------------- Using C compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpicc -fPIC -wd1572 ${COPTFLAGS} ${CFLAGS} Using Fortran compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpif90 -fPIC ${FOPTFLAGS} ${FFLAGS} ----------------------------------------- Using include paths: -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include -I/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/include -I/opt/apps/intel13/mvapich2/1.9/include ganesh > > Barry > > > > > On Jul 8, 2015, at 11:11 AM, Ganesh Vijayakumar <[email protected]> > wrote: > > > > Hello, > > > > First of all..thanks to the PETSC developers and everyone else > contributing supporting material on the web. > > > > I need to solve a system of equations Ax =b, where A is symmetric, > sparse, unstructured and in parallel as a part of a finite volume solver > for CFD. Decomposition is already done. I initialize the matrix as MPIAIJ > even though it's symmetric as that's the only thing I could get to work. > > > > MatCreate(PETSC_COMM_WORLD,&A); > > MatSetType(A,MATMPIAIJ); > > MatSetSizes(A,nCellsCurProc,nCellsCurProc,nTotalCells,nTotalCells); > > MatMPIAIJSetPreallocation(A,10,PETSC_NULL,5,PETSC_NULL); > > MatSetUp(A) ; > > // Assemble matrix using MatSetValue.. assemble full matrix even though > it's symmetric. > > MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY) ; > > MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY) ; > > //Create solver > > KSPCreate(PETSC_COMM_WORLD,&ksp); > > KSPSetType(ksp,KSPFGMRES;) > > KSPGetPC(ksp, &pc); > > PCFactorSetShiftType(pc, MAT_SHIFT_POSITIVE_DEFINITE); > > > > //PCSetType(pc, PCML); //Trilinos - Multilevel > > > > //PCSetType(pc, PCBJACOBI) ; //Incomplete LU > > //PetscOptionsSetValue("-sub_pc_type", "ilu") ; > > KSPSetFromOptions(ksp); > > > > // PCSetType(pc, PCASM); //Additive Schwarz Methods > > // PCASMSetOverlap(pc,overlap); > > > > PCSetType(pc, PCHYPRE); //HYPRE > > PCHYPRESetType(pc, "boomeramg"); > > > > //PCHYPRESetType(pc, "parasails"); > > > > KSPSetPC(ksp, pc); > > KSPSetTolerances(ksp, 1e-13, 1e-13, PETSC_DEFAULT, PETSC_DEFAULT) ; > > KSPSetInitialGuessNonzero(ksp, PETSC_TRUE); > > KSPSetOperators(ksp, A, A); > > KSPMonitorDefault(ksp, 1, 1, pVoid) ; > > > > KSPSolve(ksp,b,x); > > VecGetArray(x, &getResultArray) ; > > KSPGetResidualNorm(ksp, &endNorm) ; > > KSPGetIterationNumber(ksp, &nIterations); > > > > KSPDestroy(&ksp); > > > > > > I know that this works and gives me the correct result. However, > boomerAMG is proving to be too slow.. atleast the way I use it at 256 > processors with around 12 million unknowns. I need your advice on the > preconditioner. Am I using it right? Is there anything else I could I be > doing better? > > > > ganesh > >
