Hi again, I resorted to using Mumps, which seems to scale very well, in Slepc. However I have another question: how do you sort an MPI vector in Petsc, and can you get the permutation also? /Fredrik
Barry Smith wrote: > > On May 8, 2009, at 11:03 AM, Matthew Knepley wrote: > >> Look at the timing. The symbolic factorization takes 1e-4 seconds and >> the numeric takes >> only 10s, out of 542s. MatSolve is taking 517s. If you have a >> problem, it is likely there. >> However, the MatSolve looks balanced. > > Something is funky with this. The 28 solves should not be so much > more than the numeric factorization. > Perhaps it is worth saving the matrix and reporting this as a > performance bug to Sherrie. > > Barry > >> >> >> Matt >> >> On Fri, May 8, 2009 at 10:59 AM, Fredrik Bengzon >> <fredrik.bengzon at math.umu.se> wrote: >> Hi, >> Here is the output from the KSP and EPS objects, and the log summary. >> / Fredrik >> >> >> Reading Triangle/Tetgen mesh >> #nodes=19345 >> #elements=81895 >> #nodes per element=4 >> Partitioning mesh with METIS 4.0 >> Element distribution (rank | #elements) >> 0 | 19771 >> 1 | 20954 >> 2 | 20611 >> 3 | 20559 >> rank 1 has 257 ghost nodes >> rank 0 has 127 ghost nodes >> rank 2 has 143 ghost nodes >> rank 3 has 270 ghost nodes >> Calling 3D Navier-Lame Eigenvalue Solver >> Assembling stiffness and mass matrix >> Solving eigensystem with SLEPc >> KSP Object:(st_) >> type: preonly >> maximum iterations=100000, initial guess is zero >> tolerances: relative=1e-08, absolute=1e-50, divergence=10000 >> left preconditioning >> PC Object:(st_) >> type: lu >> LU: out-of-place factorization >> matrix ordering: natural >> LU: tolerance for zero pivot 1e-12 >> EPS Object: >> problem type: generalized symmetric eigenvalue problem >> method: krylovschur >> extraction type: Rayleigh-Ritz >> selected portion of the spectrum: largest eigenvalues in magnitude >> number of eigenvalues (nev): 4 >> number of column vectors (ncv): 19 >> maximum dimension of projected problem (mpd): 19 >> maximum number of iterations: 6108 >> tolerance: 1e-05 >> dimension of user-provided deflation space: 0 >> IP Object: >> orthogonalization method: classical Gram-Schmidt >> orthogonalization refinement: if needed (eta: 0.707100) >> ST Object: >> type: sinvert >> shift: 0 >> Matrices A and B have same nonzero pattern >> Associated KSP object >> ------------------------------ >> KSP Object:(st_) >> type: preonly >> maximum iterations=100000, initial guess is zero >> tolerances: relative=1e-08, absolute=1e-50, divergence=10000 >> left preconditioning >> PC Object:(st_) >> type: lu >> LU: out-of-place factorization >> matrix ordering: natural >> LU: tolerance for zero pivot 1e-12 >> LU: factor fill ratio needed 0 >> Factored matrix follows >> Matrix Object: >> type=mpiaij, rows=58035, cols=58035 >> package used to perform factorization: superlu_dist >> total: nonzeros=0, allocated nonzeros=116070 >> SuperLU_DIST run parameters: >> Process grid nprow 2 x npcol 2 >> Equilibrate matrix TRUE >> Matrix input mode 1 >> Replace tiny pivots TRUE >> Use iterative refinement FALSE >> Processors in row 2 col partition 2 >> Row permutation LargeDiag >> Column permutation PARMETIS >> Parallel symbolic factorization TRUE >> Repeated factorization SamePattern >> linear system matrix = precond matrix: >> Matrix Object: >> type=mpiaij, rows=58035, cols=58035 >> total: nonzeros=2223621, allocated nonzeros=2233584 >> using I-node (on process 0) routines: found 4695 nodes, >> limit used is 5 >> ------------------------------ >> Number of iterations in the eigensolver: 1 >> Number of requested eigenvalues: 4 >> Stopping condition: tol=1e-05, maxit=6108 >> Number of converged eigenpairs: 8 >> >> Writing binary .vtu file /scratch/fredrik/output/mode-0.vtu >> Writing binary .vtu file /scratch/fredrik/output/mode-1.vtu >> Writing binary .vtu file /scratch/fredrik/output/mode-2.vtu >> Writing binary .vtu file /scratch/fredrik/output/mode-3.vtu >> Writing binary .vtu file /scratch/fredrik/output/mode-4.vtu >> Writing binary .vtu file /scratch/fredrik/output/mode-5.vtu >> Writing binary .vtu file /scratch/fredrik/output/mode-6.vtu >> Writing binary .vtu file /scratch/fredrik/output/mode-7.vtu >> ************************************************************************************************************************ >> >> >> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript >> -r -fCourier9' to print this document *** >> ************************************************************************************************************************ >> >> >> >> ---------------------------------------------- PETSc Performance >> Summary: ---------------------------------------------- >> >> /home/fredrik/Hakan/cmlfet/a.out on a linux-gnu named medusa1 with 4 >> processors, by fredrik Fri May 8 17:57:28 2009 >> Using Petsc Release Version 3.0.0, Patch 5, Mon Apr 13 09:15:37 CDT 2009 >> >> Max Max/Min Avg Total >> Time (sec): 5.429e+02 1.00001 5.429e+02 >> Objects: 1.380e+02 1.00000 1.380e+02 >> Flops: 1.053e+08 1.05695 1.028e+08 4.114e+08 >> Flops/sec: 1.939e+05 1.05696 1.894e+05 7.577e+05 >> Memory: 5.927e+07 1.03224 2.339e+08 >> MPI Messages: 2.880e+02 1.51579 2.535e+02 1.014e+03 >> MPI Message Lengths: 4.868e+07 1.08170 1.827e+05 1.853e+08 >> MPI Reductions: 1.122e+02 1.00000 >> >> Flop counting convention: 1 flop = 1 real number operation of type >> (multiply/divide/add/subtract) >> e.g., VecAXPY() for real vectors of length >> N --> 2N flops >> and VecAXPY() for complex vectors of length >> N --> 8N flops >> >> Summary of Stages: ----- Time ------ ----- Flops ----- --- >> Messages --- -- Message Lengths -- -- Reductions -- >> Avg %Total Avg %Total counts >> %Total Avg %Total counts %Total >> 0: Main Stage: 5.4292e+02 100.0% 4.1136e+08 100.0% 1.014e+03 >> 100.0% 1.827e+05 100.0% 3.600e+02 80.2% >> >> ------------------------------------------------------------------------------------------------------------------------ >> >> >> See the 'Profiling' chapter of the users' manual for details on >> interpreting output. >> Phase summary info: >> Count: number of times phase was executed >> Time and Flops: Max - maximum over all processors >> Ratio - ratio of maximum to minimum over all processors >> Mess: number of messages sent >> Avg. len: average message length >> Reduct: number of global reductions >> Global: entire computation >> Stage: stages of a computation. Set stages with PetscLogStagePush() >> and PetscLogStagePop(). >> %T - percent time in this phase %F - percent flops in >> this phase >> %M - percent messages in this phase %L - percent message >> lengths in this phase >> %R - percent reductions in this phase >> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time >> over all processors) >> ------------------------------------------------------------------------------------------------------------------------ >> >> >> >> >> ########################################################## >> # # >> # WARNING!!! # >> # # >> # This code was compiled with a debugging option, # >> # To get timing results run config/configure.py # >> # using --with-debugging=no, the performance will # >> # be generally two or three times faster. # >> # # >> ########################################################## >> >> >> Event Count Time (sec) >> Flops --- Global --- --- Stage --- Total >> Max Ratio Max Ratio Max Ratio Mess Avg >> len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >> ------------------------------------------------------------------------------------------------------------------------ >> >> >> >> --- Event Stage 0: Main Stage >> >> STSetUp 1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00 >> 0.0e+00 8.0e+00 2 0 0 0 2 2 0 0 0 2 0 >> STApply 28 1.0 5.1775e+02 1.0 3.15e+07 1.1 1.7e+02 >> 4.2e+03 2.8e+01 95 30 17 0 6 95 30 17 0 8 0 >> EPSSetUp 1 1.0 1.0482e+01 1.0 0.00e+00 0.0 0.0e+00 >> 0.0e+00 4.6e+01 2 0 0 0 10 2 0 0 0 13 0 >> EPSSolve 1 1.0 3.7193e+02 1.0 9.59e+07 1.1 3.5e+02 >> 4.2e+03 9.7e+01 69 91 35 1 22 69 91 35 1 27 1 >> IPOrthogonalize 19 1.0 3.4406e-01 1.1 6.75e+07 1.1 2.3e+02 >> 4.2e+03 7.6e+01 0 64 22 1 17 0 64 22 1 21 767 >> IPInnerProduct 153 1.0 3.1410e-01 1.0 5.63e+07 1.1 2.3e+02 >> 4.2e+03 3.9e+01 0 53 23 1 9 0 53 23 1 11 700 >> IPApplyMatrix 39 1.0 2.4903e-01 1.1 4.38e+07 1.1 2.3e+02 >> 4.2e+03 0.0e+00 0 42 23 1 0 0 42 23 1 0 687 >> UpdateVectors 1 1.0 4.2958e-03 1.2 4.51e+06 1.1 0.0e+00 >> 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 4107 >> VecDot 1 1.0 5.6815e-04 4.7 2.97e+04 1.1 0.0e+00 >> 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 204 >> VecNorm 8 1.0 2.5260e-03 3.2 2.38e+05 1.1 0.0e+00 >> 0.0e+00 8.0e+00 0 0 0 0 2 0 0 0 0 2 368 >> VecScale 27 1.0 5.9605e-04 1.1 4.01e+05 1.1 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2629 >> VecCopy 53 1.0 4.0610e-03 1.4 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecSet 77 1.0 6.2165e-03 1.1 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecAXPY 38 1.0 2.7709e-03 1.7 1.13e+06 1.1 0.0e+00 >> 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 1592 >> VecMAXPY 38 1.0 2.5925e-02 1.1 1.13e+07 1.1 0.0e+00 >> 0.0e+00 0.0e+00 0 11 0 0 0 0 11 0 0 0 1701 >> VecAssemblyBegin 5 1.0 9.0070e-03 2.3 0.00e+00 0.0 3.6e+01 >> 2.1e+04 1.5e+01 0 0 4 0 3 0 0 4 0 4 0 >> VecAssemblyEnd 5 1.0 3.4809e-04 1.1 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecScatterBegin 73 1.0 8.5931e-03 1.5 0.00e+00 0.0 4.6e+02 >> 8.9e+03 0.0e+00 0 0 45 2 0 0 0 45 2 0 0 >> VecScatterEnd 73 1.0 2.2542e-02 2.2 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecReduceArith 76 1.0 3.0838e-02 1.1 1.24e+07 1.1 0.0e+00 >> 0.0e+00 0.0e+00 0 12 0 0 0 0 12 0 0 0 1573 >> VecReduceComm 38 1.0 4.8040e-02 2.0 0.00e+00 0.0 0.0e+00 >> 0.0e+00 3.8e+01 0 0 0 0 8 0 0 0 0 11 0 >> VecNormalize 8 1.0 2.7280e-03 2.8 3.56e+05 1.1 0.0e+00 >> 0.0e+00 8.0e+00 0 0 0 0 2 0 0 0 0 2 511 >> MatMult 67 1.0 4.1397e-01 1.1 7.53e+07 1.1 4.0e+02 >> 4.2e+03 0.0e+00 0 71 40 1 0 0 71 40 1 0 710 >> MatSolve 28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 95 0 0 0 0 95 0 0 0 0 0 >> MatLUFactorSym 1 1.0 3.6097e-04 1.1 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatLUFactorNum 1 1.0 1.0464e+01 1.0 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 >> MatAssemblyBegin 9 1.0 3.3842e-0146.7 0.00e+00 0.0 5.4e+01 >> 6.0e+04 8.0e+00 0 0 5 2 2 0 0 5 2 2 0 >> MatAssemblyEnd 9 1.0 2.3042e-01 1.0 0.00e+00 0.0 3.6e+01 >> 9.4e+02 3.1e+01 0 0 4 0 7 0 0 4 0 9 0 >> MatGetRow 5206 1.1 3.1164e-03 1.1 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatGetSubMatrice 5 1.0 8.7580e-01 1.2 0.00e+00 0.0 1.5e+02 >> 1.1e+06 2.5e+01 0 0 15 88 6 0 0 15 88 7 0 >> MatZeroEntries 2 1.0 1.0233e-02 1.1 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatView 2 1.0 1.0149e-03 2.0 0.00e+00 0.0 0.0e+00 >> 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 1 0 >> KSPSetup 1 1.0 2.8610e-06 1.5 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> KSPSolve 28 1.0 5.1758e+02 1.0 0.00e+00 0.0 0.0e+00 >> 0.0e+00 2.8e+01 95 0 0 0 6 95 0 0 0 8 0 >> PCSetUp 1 1.0 1.0467e+01 1.0 0.00e+00 0.0 0.0e+00 >> 0.0e+00 8.0e+00 2 0 0 0 2 2 0 0 0 2 0 >> PCApply 28 1.0 5.1757e+02 1.0 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 95 0 0 0 0 95 0 0 0 0 0 >> ------------------------------------------------------------------------------------------------------------------------ >> >> >> >> Memory usage is given in bytes: >> >> Object Type Creations Destructions Memory Descendants' >> Mem. >> >> --- Event Stage 0: Main Stage >> >> Spectral Transform 1 1 536 0 >> Eigenproblem Solver 1 1 824 0 >> Inner product 1 1 428 0 >> Index Set 38 38 1796776 0 >> IS L to G Mapping 1 1 58700 0 >> Vec 65 65 5458584 0 >> Vec Scatter 9 9 7092 0 >> Application Order 1 1 155232 0 >> Matrix 17 16 17715680 0 >> Krylov Solver 1 1 832 0 >> Preconditioner 1 1 744 0 >> Viewer 2 2 1088 0 >> ======================================================================================================================== >> >> >> Average time to get PetscTime(): 1.90735e-07 >> Average time for MPI_Barrier(): 5.9557e-05 >> Average time for zero size MPI_Send(): 2.97427e-05 >> #PETSc Option Table entries: >> -log_summary >> -mat_superlu_dist_parsymbfact >> #End o PETSc Option Table entries >> Compiled without FORTRAN kernels >> Compiled with full precision matrices (default) >> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 >> sizeof(PetscScalar) 8 >> Configure run at: Wed May 6 15:14:39 2009 >> Configure options: --download-superlu_dist=1 --download-parmetis=1 >> --with-mpi-dir=/usr/lib/mpich --with-shared=0 >> ----------------------------------------- >> Libraries compiled on Wed May 6 15:14:49 CEST 2009 on medusa1 >> Machine characteristics: Linux medusa1 2.6.18-6-amd64 #1 SMP Fri Dec >> 12 05:49:32 UTC 2008 x86_64 GNU/Linux >> Using PETSc directory: >> /home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5 >> Using PETSc arch: linux-gnu-c-debug >> ----------------------------------------- >> Using C compiler: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings >> -Wno-strict-aliasing -g3 Using Fortran compiler: >> /usr/lib/mpich/bin/mpif77 -Wall -Wno-unused-variable -g >> ----------------------------------------- >> Using include paths: >> -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/include >> >> -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/include >> -I/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/include >> >> -I/usr/lib/mpich/include ------------------------------------------ >> Using C linker: /usr/lib/mpich/bin/mpicc -Wall -Wwrite-strings >> -Wno-strict-aliasing -g3 >> Using Fortran linker: /usr/lib/mpich/bin/mpif77 -Wall >> -Wno-unused-variable -g Using libraries: >> -Wl,-rpath,/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib >> >> -L/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib >> -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec >> -lpetsc -lX11 >> -Wl,-rpath,/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib >> >> -L/home/fredrik/Hakan/cmlfet/external/petsc-3.0.0-p5/linux-gnu-c-debug/lib >> -lsuperlu_dist_2.3 -llapack -lblas -lparmetis -lmetis -lm >> -L/usr/lib/mpich/lib -L/usr/lib/gcc/x86_64-linux-gnu/4.1.2 >> -L/usr/lib64 -L/lib64 -ldl -lmpich -lpthread -lrt -lgcc_s -lg2c -lm >> -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6 -L/lib -lm -ldl -lmpich >> -lpthread -lrt -lgcc_s -ldl >> ------------------------------------------ >> >> real 9m10.616s >> user 0m23.921s >> sys 0m6.944s >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Satish Balay wrote: >> Just a note about scalability: its a function of the hardware as >> well.. For proper scalability studies - you'll need a true distributed >> system with fast network [not SMP nodes..] >> >> Satish >> >> On Fri, 8 May 2009, Fredrik Bengzon wrote: >> >> >> Hong, >> Thank you for the suggestions, but I have looked at the EPS and KSP >> objects >> and I can not find anything wrong. The problem is that it takes >> longer to >> solve with 4 cpus than with 2 so the scalability seems to be absent >> when using >> superlu_dist. I have stored my mass and stiffness matrix in the >> mpiaij format >> and just passed them on to slepc. When using the petsc iterative krylov >> solvers i see 100% workload on all processors but when i switch to >> superlu_dist only two cpus seem to do the whole work of LU factoring. >> I don't >> want to use the krylov solver though since it might cause slepc not to >> converge. >> Regards, >> Fredrik >> >> Hong Zhang wrote: >> >> Run your code with '-eps_view -ksp_view' for checking >> which methods are used >> and '-log_summary' to see which operations dominate >> the computation. >> >> You can turn on parallel symbolic factorization >> with '-mat_superlu_dist_parsymbfact'. >> >> Unless you use large num of processors, symbolic factorization >> takes ignorable execution time. The numeric >> factorization usually dominates. >> >> Hong >> >> On Fri, 8 May 2009, Fredrik Bengzon wrote: >> >> >> Hi Petsc team, >> Sorry for posting questions not really concerning the petsc core, but >> when >> I run superlu_dist from within slepc I notice that the load balance is >> poor. It is just fine during assembly (I use Metis to partition my >> finite >> element mesh) but when calling the slepc solver it dramatically >> changes. I >> use superlu_dist as solver for the eigenvalue iteration. My question is: >> can this have something to do with the fact that the option 'Parallel >> symbolic factorization' is set to false? If so, can I change the options >> to superlu_dist using MatSetOption for instance? Also, does this mean >> that >> superlu_dist is not using parmetis to reorder the matrix? >> Best Regards, >> Fredrik Bengzon >> >> >> >> >> >> >> >> >> >> >> >> --What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which >> their experiments lead. >> -- Norbert Wiener > >
