Re: [petsc-users] Need help improving working PETSC code

Barry Smith Thu, 09 Jul 2015 15:47:34 -0700

> On Jul 9, 2015, at 2:56 PM, Ganesh Vijayakumar <[email protected]> wrote:
> 
> Thanks. I apologize for any stupid assumptions/mistakes.


  You have not made any stupid assumptions/mistakes

> One thing that became clear to me was to get the machinery working first, 
> improve later. I respond to each of your questions below them.
> 
> On Wed, Jul 8, 2015 at 4:27 PM Barry Smith <[email protected]> wrote:
> 
>    Please provide a bit more detail about the operator. Is it a pressure 
> solver for CFD? Cell centered? Does the matrix has a null space of the 
> constant functions?
> 
> The operator.. yes it's a pressure solver for a PISO/SIMPLE type algorithm. 
> The underlying code is OpenFOAM. 

Make sure to read all the way down.

Ok. I understand

> 
> Reg. null space of constant functions, let me verify ... Do you mean Ax = b 
> becomes A (x_1 + x_2) = (c+d) where Ax_1 = c and A x_2 = d where d is 
> constant or zero right? The basic equation is a poisson equation, contains 
> only the 2nd derivative. So I suppose if the entire solution is offset by a 
> constant, it would still satisfy the equations. However, the problem does 
> have Dirichlet boundary conditions on one boundary. So I'm thinking no null 
> space.

 Yes, no null space
>  
> Is it the same linear system for each "time-step?" in the CFD solver or 
> different?
> 
> Different. The structure of the matrix doesn't change every time step (for 
> now), however the values do.
> 
>    How many iterations is hypre BoomerAMG typically taking?
> 
>  Around 5 after I relaxed the tolerances (Used to be around 10-13 when I had 
> the tighter tolerances). I solve the system of equations 3 times per 
> "time-step" like here. 
> 
> Mat Object: 256 MPI processes
>   type: mpiaij
>   rows=12254823, cols=12254823
>   total: nonzeros=7.65502e+07, allocated nonzeros=1.83822e+08
>   total number of mallocs used during MatSetValues calls =0
>     not using I-node (on process 0) routines
> simple.corrNonOrtho() = 1 simple.nNonOrthCorr() = 2
>   1 KSP Residual norm 1.000000000000e+00
> End residual = 0.000816181009378
> Number of iterations = 4
> simple.corrNonOrtho() = 2 simple.nNonOrthCorr() = 2
>   1 KSP Residual norm 1.000000000000e+00
> End residual = 0.00118562245298
> Number of iterations = 3
> simple.corrNonOrtho() = 3 simple.nNonOrthCorr() = 2
>   1 KSP Residual norm 1.000000000000e+00
> End residual = 1.15194716042e-06
> Number of iterations = 5
> 
>    This is an extremely tight tolerance, why do you set it so small? 
> KSPSetTolerances(ksp, 1e-13,  1e-13,  PETSC_DEFAULT,  PETSC_DEFAULT) ;
> 
> Well, that was just a first start.. wanted to make sure it worked.. But 
> reducing the required tolerance to 1e-5, 1e-5 also doesn't change the run 
> time by much. 5 "time-steps" now take 250s compared to 260s earlier. As a 
> comparison, I can use OpenFOAM's inbuilt PCG solver + DIC preconditioner

  What is DIC?

> to do the same in about 13s (Thirteen). Sure the convergence tolerances are 
> not quite the same, but I don't think that's the issue here. I'm trying to do 
> one better OpenFOAM's existing solver suite. I think a good first step would 
> be to match it. Technically I'm comparing apples and oranges by comparing PCG 
> + DIC vs. FGMRES + AMG, however, I was hoping I could better. I'm open to 
> changing pretty much anything.
> 
>    Send the output from a run with -log_summary so we can see where the time 
> is being spent.
> 
> Thanks for this. This is below.
> 
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
> -fCourier9' to print this document            ***
> ************************************************************************************************************************
> 
> ---------------------------------------------- PETSc Performance Summary: 
> ----------------------------------------------
> 
> /scratch/02504/ganesh10/OpenFOAM/ganesh10-2.1.x/platforms/linux64IccDPOpt/bin/SRFSimpleFoamPETSC
>  on a sandybridge named c415-502.stampede.tacc.utexas.edu with 256 
> processors, by ganesh10 Thu
>  Jul  9 13:29:19 2015
> Using Petsc Release Version 3.5.3, Jan, 31, 2015
> 
>                          Max       Max/Min        Avg      Total
> Time (sec):           2.473e+02      1.00143   2.472e+02
> Objects:              5.210e+02      1.00000   5.210e+02
> Flops:                1.256e+08      1.17291   1.184e+08  3.031e+10
> Flops/sec:            5.086e+05      1.17366   4.790e+05  1.226e+08
> MPI Messages:         3.956e+03     21.50000   1.167e+03  2.986e+05
> MPI Message Lengths:  6.769e+06      4.61934   3.787e+03  1.131e+09
> MPI Reductions:       3.680e+02      1.00000
> 
> Flop counting convention: 1 flop = 1 real number operation of type 
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N --> 
> 2N flops
>                             and VecAXPY() for complex vectors of length N --> 
> 8N flops
> 
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  
> -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts   %Total   
>   Avg         %Total   counts   %Total
>  0:      Main Stage: 2.4720e+02 100.0%  3.0311e+10 100.0%  2.986e+05 100.0%  
> 3.787e+03      100.0%  3.670e+02  99.7%
> 
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting 
> output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flops: Max - maximum over all processors
>                    Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    Avg. len: average message length (bytes)
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and 
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flops in this phase
>       %M - percent messages in this phase     %L - percent message lengths in 
> this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over 
> all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flops                          
>    --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
Where is all the data?  It should list all the events and time it spends in 
each. Did  you use PetscOptionsSetValue() to provide -log_summary? That won't 
work you need to pass it on the command line or in the PETSC_OPTIONS 
environmental variable or in a file called petscrc 


Well regardless, it is likely that the matrix is well conditioned enough that 
the large set up time of hypre BoomerAMG is killing performance. How many 
iterations in the PCG plus DIC taking?

BTW: you can also run with for example -ksp_type cg -pc_type jacobi or 
-ksp_type cg -pc_type bjacobi -sub_pc_type icc to get simpler preconditioners 
with much lower set up times. What do you get for those?

Barry



> ------------------------------------------------------------------------------------------------------------------------
> 
> Memory usage is given in bytes:
> 
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
> 
> --- Event Stage 0: Main Stage
> 
>               Vector   484            482    184328584     0
>       Vector Scatter     1              1         1060     0
>               Matrix     3              3     10175808     0
>            Index Set     2              2        38884     0
>               Viewer     1              0            0     0
>        Krylov Solver    15             15       284400     0
>       Preconditioner    15             15        16440     0
> ========================================================================================================================
> Average time to get PetscTime(): 9.53674e-08
> Average time for MPI_Barrier(): 1.64032e-05
> Average time for zero size MPI_Send(): 0.000232594
> #PETSc Option Table entries:
> -info blah
> -log_summary
> -mat_view ::ascii_info
> -parallel
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: --with-x=0 -with-pic 
> --with-external-packages-dir=/opt/apps/intel13/mvapich2_1_9/petsc/3.5/externalpackages
>  --with-mpi-compilers=1 --with-mpi-dir=/opt/apps/intel13/mvapich2/1.9 
> --with-scalar-type=real --with-shared-libraries=1 --with-precision=double 
> --with-hypre=1 --download-hypre --with-ml=1 --download-ml --with-ml=1 
> --download-ml --with-superlu_dist=1 --download-superlu_dist --with-superlu=1 
> --download-superlu --with-parmetis=1 --download-parmetis --with-metis=1 
> --download-metis --with-spai=1 --download-spai --with-mumps=1 
> --download-mumps --with-parmetis=1 --download-parmetis --with-metis=1 
> --download-metis --with-scalapack=1 --download-scalapack --with-blacs=1 
> --download-blacs --with-spooles=1 --download-spooles --with-hdf5=1 
> --with-hdf5-dir=/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9 
> --with-debugging=no 
> --with-blas-lapack-dir=/opt/apps/intel/13/composer_xe_2013.2.146/mkl 
> --with-mpiexec=mpirun_rsh --COPTFLAGS= --FOPTFLAGS= --CXXOPTFLAGS=
> -----------------------------------------
> Libraries compiled on Thu Apr  2 10:06:57 2015 on 
> staff.stampede.tacc.utexas.edu
> Machine characteristics: 
> Linux-2.6.32-431.17.1.el6.x86_64-x86_64-with-centos-6.6-Final
> Using PETSc directory: /opt/apps/intel13/mvapich2_1_9/petsc/3.5
> Using PETSc arch: sandybridge
> -----------------------------------------
> 
> Using C compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpicc  -fPIC -wd1572   
> ${COPTFLAGS} ${CFLAGS}
> Using Fortran compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpif90  -fPIC    
> ${FOPTFLAGS} ${FFLAGS}
> -----------------------------------------
> 
> Using include paths: 
> -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include 
> -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include 
> -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include 
> -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include 
> -I/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/include 
> -I/opt/apps/intel13/mvapich2/1.9/include
> 
> 
> ganesh
>  
> 
>    Barry
> 
> 
> 
> > On Jul 8, 2015, at 11:11 AM, Ganesh Vijayakumar <[email protected]> 
> > wrote:
> >
> > Hello,
> >
> >  First of all..thanks to the PETSC developers and everyone else 
> > contributing supporting material on the web.
> >
> >  I need to solve a system of equations Ax =b, where A is symmetric, sparse, 
> > unstructured and in parallel as a part of a finite volume solver for CFD. 
> > Decomposition is already done. I initialize the matrix as MPIAIJ even 
> > though it's symmetric as that's the only thing I could get to work.
> >
> > MatCreate(PETSC_COMM_WORLD,&A);
> > MatSetType(A,MATMPIAIJ);
> > MatSetSizes(A,nCellsCurProc,nCellsCurProc,nTotalCells,nTotalCells);
> > MatMPIAIJSetPreallocation(A,10,PETSC_NULL,5,PETSC_NULL);
> > MatSetUp(A) ;
> > // Assemble matrix using MatSetValue.. assemble full matrix even though 
> > it's symmetric.
> > MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY) ;
> > MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY) ;
> > //Create solver
> > KSPCreate(PETSC_COMM_WORLD,&ksp);
> > KSPSetType(ksp,KSPFGMRES;)
> > KSPGetPC(ksp, &pc);
> > PCFactorSetShiftType(pc, MAT_SHIFT_POSITIVE_DEFINITE);
> >
> > //PCSetType(pc, PCML); //Trilinos - Multilevel
> >
> > //PCSetType(pc, PCBJACOBI) ; //Incomplete LU
> > //PetscOptionsSetValue("-sub_pc_type", "ilu") ;
> > KSPSetFromOptions(ksp);
> >
> > // PCSetType(pc, PCASM); //Additive Schwarz Methods
> > // PCASMSetOverlap(pc,overlap);
> >
> > PCSetType(pc, PCHYPRE); //HYPRE
> > PCHYPRESetType(pc, "boomeramg");
> >
> > //PCHYPRESetType(pc, "parasails");
> >
> > KSPSetPC(ksp, pc);
> > KSPSetTolerances(ksp, 1e-13,  1e-13,  PETSC_DEFAULT,  PETSC_DEFAULT) ;
> > KSPSetInitialGuessNonzero(ksp, PETSC_TRUE);
> > KSPSetOperators(ksp, A, A);
> > KSPMonitorDefault(ksp, 1, 1, pVoid) ;
> >
> > KSPSolve(ksp,b,x);
> > VecGetArray(x, &getResultArray) ;
> > KSPGetResidualNorm(ksp, &endNorm) ;
> > KSPGetIterationNumber(ksp, &nIterations);
> >
> > KSPDestroy(&ksp);
> >
> >
> >  I know that this works and gives me the correct result. However, boomerAMG 
> > is proving to be too slow.. atleast the way I use it at 256 processors with 
> > around 12 million unknowns. I need your advice on the preconditioner. Am I 
> > using it right? Is there anything else I could I be doing better?
> >
> > ganesh
>

Re: [petsc-users] Need help improving working PETSC code

Reply via email to