Look at the timing. The symbolic factorization takes 1e-4 seconds and the numeric takes only 10s, out of 542s. MatSolve is taking 517s. If you have a problem, it is likely there. However, the MatSolve looks balanced.
Matt On Fri, May 8, 2009 at 10:59 AM, Fredrik Bengzon < fredrik.bengzon at math.umu.se> wrote: > Hi, > Here is the output from the KSP and EPS objects, and the log summary. > / Fredrik > > > Reading Triangle/Tetgen mesh > #nodes=19345 > #elements=81895 > #nodes per element=4 > Partitioning mesh with METIS 4.0 > Element distribution (rank | #elements) > 0 | 19771 > 1 | 20954 > 2 | 20611 > 3 | 20559 > rank 1 has 257 ghost nodes > rank 0 has 127 ghost nodes > rank 2 has 143 ghost nodes > rank 3 has 270 ghost nodes > Calling 3D Navier-Lame Eigenvalue Solver > Assembling stiffness and mass matrix > Solving eigensystem with SLEPc > KSP Object:(st_) > type: preonly > maximum iterations=100000, initial guess is zero > tolerances: relative=1e-08, absolute=1e-50, divergence=10000 > left preconditioning > PC Object:(st_) > type: lu > LU: out-of-place factorization > matrix ordering: natural > LU: tolerance for zero pivot 1e-12 > EPS Object: > problem type: generalized symmetric eigenvalue problem > method: krylovschur > extraction type: Rayleigh-Ritz > selected portion of the spectrum: largest eigenvalues in magnitude > number of eigenvalues (nev): 4 > number of column vectors (ncv): 19 > maximum dimension of projected problem (mpd): 19 > maximum number of iterations: 6108 > tolerance: 1e-05 > dimension of user-provided deflation space: 0 > IP Object: > orthogonalization method: classical Gram-Schmidt > orthogonalization refinement: if needed (eta: 0.707100) > ST Object: > type: sinvert > shift: 0 > Matrices A and B have same nonzero pattern > Associated KSP object > ------------------------------ > KSP Object:(st_) > type: preonly > maximum iterations=100000, initial guess is zero > tolerances: relative=1e-08, absolute=1e-50, divergence=10000 > left preconditioning > PC Object:(st_) > type: lu > LU: out-of-place factorization > matrix ordering: natural > LU: tolerance for zero pivot 1e-12 > LU: factor fill ratio needed 0 > Factored matrix follows > Matrix Object: > type=mpiaij, rows=58035, cols=58035 > package used to perform factorization: superlu_dist > total: nonzeros=0, allocated nonzeros=116070 > SuperLU_DIST run parameters: > Process grid nprow 2 x npcol 2 > Equilibrate matrix TRUE > Matrix input mode 1 > Replace tiny pivots TRUE > Use iterative refinement FALSE > Processors in row 2 col partition 2 > Row permutation LargeDiag > Column permutation PARMETIS > Parallel symbolic factorization TRUE > Repeated factorization SamePattern > linear system matrix = precond matrix: > Matrix Object: > type=mpiaij, rows=58035, cols=58035 > total: nonzeros=2223621, allocated nonzeros=2233584 > using I-node (on process 0) routines: found 4695 nodes, limit > used is 5 > ------------------------------ > Number of iterations in the eigensolver: 1 > Number of requested eigenvalues: 4 > Stopping condition: tol=1e-05, maxit=6108 > Number of converged eigenpairs: 8 > > Writing binary .vtu file /scratch/fredrik/output/mode-0.vtu > Writing binary .vtu file /scratch/fredrik/output/mode-1.vtu > Writing binary .vtu file /scratch/fredrik/output/mode-2.vtu > Writing binary .vtu file /scratch/fredrik/output/mode-3.vtu > Writing binary .vtu file /scratch/fredrik/output/mode-4.vtu > Writing binary .vtu file /scratch/fredrik/output/mode-5.vtu > Writing binary .vtu file /scratch/fredrik/output/mode-6.vtu > Writing binary .vtu file /scratch/fredrik/output/mode-7.vtu > > ************************************************************************************************************************ > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r > -fCourier9' to print this document *** > > ************************************************************************************************************************ > > ---------------------------------------------- PETSc Performance Summary: > ---------------------------------------------- > > /home/fredrik/Hakan/cmlfet/a.out on a linux-gnu named medusa1 with 4 > processors, by fredrik Fri May 8 17:57:28 2009 > Using Petsc Release Version 3.0.0, Patch 5, Mon Apr 13 09:15:37 CDT 2009 > > Max Max/Min Avg Total > Time (sec): 5.429e+02 1.00001 5.429e+02 > Objects: 1.380e+02 1.00000 1.380e+02 > Flops: 1.053e+08 1.05695 1.028e+08 4.114e+08 > Flops/sec: 1.939e+05 1.05696 1.894e+05 7.577e+05 > Memory: 5.927e+07 1.03224 2.339e+08 > MPI Messages: 2.880e+02 1.51579 2.535e+02 1.014e+03 > MPI Message Lengths: 4.868e+07 1.08170 1.827e+05 1.853e+08 > MPI Reductions: 1.122e+02 1.00000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N --> > 2N flops > and VecAXPY() for complex vectors of length N --> > 8N flops > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- > -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total counts %Total > Avg %Total counts %Total > 0: Main Stage: 5.4292e+02 100.0% 4.1136e+08 100.0% 1.014e+03 100.0% > 1.827e+05 100.0% 3.600e+02 80.2% > > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flops: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all processors > Mess: number of messages sent > Avg. len: average message length > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flops in this > phase > %M - percent messages in this phase %L - percent message lengths in > this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over > all processors) > > ------------------------------------------------------------------------------------------------------------------------ > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was compiled with a debugging option, # > # To get timing results run config/configure.py # > # using --with-debugging=no, the performance will # > # be generally two or three times faster. # > # # > ########################################################## > > > Event Count Time (sec) Flops > --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg len > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > STSetUp 1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 8.0e+00 2 0 0 0 2 2 0 0 0 2 0 > STApply 28 1.0 5.1775e+02 1.0 3.15e+07 1.1 1.7e+02 4.2e+03 > 2.8e+01 95 30 17 0 6 95 30 17 0 8 0 > EPSSetUp 1 1.0 1.0482e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 4.6e+01 2 0 0 0 10 2 0 0 0 13 0 > EPSSolve 1 1.0 3.7193e+02 1.0 9.59e+07 1.1 3.5e+02 4.2e+03 > 9.7e+01 69 91 35 1 22 69 91 35 1 27 1 > IPOrthogonalize 19 1.0 3.4406e-01 1.1 6.75e+07 1.1 2.3e+02 4.2e+03 > 7.6e+01 0 64 22 1 17 0 64 22 1 21 767 > IPInnerProduct 153 1.0 3.1410e-01 1.0 5.63e+07 1.1 2.3e+02 4.2e+03 > 3.9e+01 0 53 23 1 9 0 53 23 1 11 700 > IPApplyMatrix 39 1.0 2.4903e-01 1.1 4.38e+07 1.1 2.3e+02 4.2e+03 > 0.0e+00 0 42 23 1 0 0 42 23 1 0 687 > UpdateVectors 1 1.0 4.2958e-03 1.2 4.51e+06 1.1 0.0e+00 0.0e+00 > 0.0e+00 0 4 0 0 0 0 4 0 0 0 4107 > VecDot 1 1.0 5.6815e-04 4.7 2.97e+04 1.1 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 0 0 0 0 0 0 204 > VecNorm 8 1.0 2.5260e-03 3.2 2.38e+05 1.1 0.0e+00 0.0e+00 > 8.0e+00 0 0 0 0 2 0 0 0 0 2 368 > VecScale 27 1.0 5.9605e-04 1.1 4.01e+05 1.1 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 2629 > VecCopy 53 1.0 4.0610e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 77 1.0 6.2165e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAXPY 38 1.0 2.7709e-03 1.7 1.13e+06 1.1 0.0e+00 0.0e+00 > 0.0e+00 0 1 0 0 0 0 1 0 0 0 1592 > VecMAXPY 38 1.0 2.5925e-02 1.1 1.13e+07 1.1 0.0e+00 0.0e+00 > 0.0e+00 0 11 0 0 0 0 11 0 0 0 1701 > VecAssemblyBegin 5 1.0 9.0070e-03 2.3 0.00e+00 0.0 3.6e+01 2.1e+04 > 1.5e+01 0 0 4 0 3 0 0 4 0 4 0 > VecAssemblyEnd 5 1.0 3.4809e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecScatterBegin 73 1.0 8.5931e-03 1.5 0.00e+00 0.0 4.6e+02 8.9e+03 > 0.0e+00 0 0 45 2 0 0 0 45 2 0 0 > VecScatterEnd 73 1.0 2.2542e-02 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecReduceArith 76 1.0 3.0838e-02 1.1 1.24e+07 1.1 0.0e+00 0.0e+00 > 0.0e+00 0 12 0 0 0 0 12 0 0 0 1573 > VecReduceComm 38 1.0 4.8040e-02 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 3.8e+01 0 0 0 0 8 0 0 0 0 11 0 > VecNormalize 8 1.0 2.7280e-03 2.8 3.56e+05 1.1 0.0e+00 0.0e+00 > 8.0e+00 0 0 0 0 2 0 0 0 0 2 511 > MatMult 67 1.0 4.1397e-01 1.1 7.53e+07 1.1 4.0e+02 4.2e+03 > 0.0e+00 0 71 40 1 0 0 71 40 1 0 710 > MatSolve 28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 95 0 0 0 0 95 0 0 0 0 0 > MatLUFactorSym 1 1.0 3.6097e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatLUFactorNum 1 1.0 1.0464e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 > MatAssemblyBegin 9 1.0 3.3842e-0146.7 0.00e+00 0.0 5.4e+01 6.0e+04 > 8.0e+00 0 0 5 2 2 0 0 5 2 2 0 > MatAssemblyEnd 9 1.0 2.3042e-01 1.0 0.00e+00 0.0 3.6e+01 9.4e+02 > 3.1e+01 0 0 4 0 7 0 0 4 0 9 0 > MatGetRow 5206 1.1 3.1164e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetSubMatrice 5 1.0 8.7580e-01 1.2 0.00e+00 0.0 1.5e+02 1.1e+06 > 2.5e+01 0 0 15 88 6 0 0 15 88 7 0 > MatZeroEntries 2 1.0 1.0233e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatView 2 1.0 1.0149e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 1 0 > KSPSetup 1 1.0 2.8610e-06 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 28 1.0 5.1758e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.8e+01 95 0 0 0 6 95 0 0 0 8 0 > PCSetUp 1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 8.0e+00 2 0 0 0 2 2 0 0 0 2 0 > PCApply 28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 95 0 0 0 0 95 0 0 0 0 0 > > ------------------------------------------------------------------------------------------------------------------------ > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' Mem. > > --- Event Stage 0: Main Stage > > Spectral Transform 1 1 536 0 > Eigenproblem Solver 1 1 824 0 > Inner product 1 1 428 0 > Index Set 38 38 1796776 0 > IS L to G Mapping 1 1 58700 0 > Vec 65 65 5458584 0 > Vec Scatter 9 9 7092 0 > Application Order 1 1 155232 0 > Matrix 17 16 17715680 0 > Krylov Solver 1 1 832 0 > Preconditioner 1 1 744 0 > Viewer 2 2 1088 0 > > ======================================================================================================================== > Average time to get PetscTime(): 1.90735e-07 > Average time for MPI_Barrier(): 5.9557e-05 > Average time for zero size MPI_Send(): 2.97427e-05 > #PETSc Option Table entries: > -log_summary > -mat_superlu_dist_parsymbfact > #End o PETSc Option Table entries > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > Configure run at: Wed May 6 15:14:39 2009 > Configure options: --download-superlu_dist=1 --download-parmetis=1 > --with-mpi-dir=/usr/lib/mpich --with-shared=0 > ----------------------------------------- > Libraries compiled on Wed May 6 15:14:49 CEST 2009 on medusa1 > Machine characteristics: Linux medusa1 2.6.18-6-amd64 #1 SMP Fri Dec 12 > 05:49:32 UTC 2008 x86_64 GNU/Linux > Using PETSc directory: /home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5 > Using PETSc arch: linux-gnu-c-debug > ----------------------------------------- > Using C compiler: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings > -Wno-strict-aliasing -g3 Using Fortran compiler: /usr/lib/mpich/bin/mpif77 > -Wall -Wno-unused-variable -g ----------------------------------------- > Using include paths: > -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/include > -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/include > -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/include > -I/usr/lib/mpich/include ------------------------------------------ > Using C linker: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings > -Wno-strict-aliasing -g3 > Using Fortran linker: /usr/lib/mpich/bin/mpif77 -Wall -Wno-unused-variable > -g Using libraries: > -Wl,-rpath,/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib > -L/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib > -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc > -lX11 > -Wl,-rpath,/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib > -L/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib > -lsuperlu_dist_2.3 -llapack -lblas -lparmetis -lmetis -lm > -L/usr/lib/mpich/lib -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2 -L/usr/lib64 > -L/lib64 -ldl -lmpich -lpthread -lrt -lgcc_s -lg2c -lm > -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6 -L/lib -lm -ldl -lmpich -lpthread -lrt > -lgcc_s -ldl > ------------------------------------------ > > real 9m10.616s > user 0m23.921s > sys 0m6.944s > > > > > > > > > > > > > > > > > > > > Satish Balay wrote: > >> Just a note about scalability: its a function of the hardware as >> well.. For proper scalability studies - you'll need a true distributed >> system with fast network [not SMP nodes..] >> >> Satish >> >> On Fri, 8 May 2009, Fredrik Bengzon wrote: >> >> >> >>> Hong, >>> Thank you for the suggestions, but I have looked at the EPS and KSP >>> objects >>> and I can not find anything wrong. The problem is that it takes longer to >>> solve with 4 cpus than with 2 so the scalability seems to be absent when >>> using >>> superlu_dist. I have stored my mass and stiffness matrix in the mpiaij >>> format >>> and just passed them on to slepc. When using the petsc iterative krylov >>> solvers i see 100% workload on all processors but when i switch to >>> superlu_dist only two cpus seem to do the whole work of LU factoring. I >>> don't >>> want to use the krylov solver though since it might cause slepc not to >>> converge. >>> Regards, >>> Fredrik >>> >>> Hong Zhang wrote: >>> >>> >>>> Run your code with '-eps_view -ksp_view' for checking >>>> which methods are used >>>> and '-log_summary' to see which operations dominate >>>> the computation. >>>> >>>> You can turn on parallel symbolic factorization >>>> with '-mat_superlu_dist_parsymbfact'. >>>> >>>> Unless you use large num of processors, symbolic factorization >>>> takes ignorable execution time. The numeric >>>> factorization usually dominates. >>>> >>>> Hong >>>> >>>> On Fri, 8 May 2009, Fredrik Bengzon wrote: >>>> >>>> >>>> >>>>> Hi Petsc team, >>>>> Sorry for posting questions not really concerning the petsc core, but >>>>> when >>>>> I run superlu_dist from within slepc I notice that the load balance is >>>>> poor. It is just fine during assembly (I use Metis to partition my >>>>> finite >>>>> element mesh) but when calling the slepc solver it dramatically >>>>> changes. I >>>>> use superlu_dist as solver for the eigenvalue iteration. My question >>>>> is: >>>>> can this have something to do with the fact that the option 'Parallel >>>>> symbolic factorization' is set to false? If so, can I change the >>>>> options >>>>> to superlu_dist using MatSetOption for instance? Also, does this mean >>>>> that >>>>> superlu_dist is not using parmetis to reorder the matrix? >>>>> Best Regards, >>>>> Fredrik Bengzon >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >> >> >> > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20090508/a7e5866e/attachment-0001.htm>
