Michele Rosso <mro...@uci.edu> writes: > Hi Jed, > > I attached the output for both the runs you suggested. At the beginning > of each file I included the options I used. > > On a side note, I tried to run with a grid of 256^3 (exactly as before) > but with less levels, i.e. 3 instead of 4 or 5. > My system stops the run because of an Out Of Memory condition. It is > really odd since I have not changed anything except > - pc_mg_levels. I cannot send you any output since there is none. Do > you have any guess where the problem comes from?
The selected algorithm does a direct solve on the coarse grid. Each time you reduce the number of levels, the coarse grid size grows by a factor of 8. Going from 5 to 3 levels is going from a 16^3 coarse grid to a 64^3 coarse grid. Applying a direct solver to the latter ends up using a lot of memory. I think this is not worth bothering with and it might even be (slightly) faster to use 6 levels. That is not where the time is being spent. > -log_summary -ksp_monitor -ksp_view -ksp_converged_reason -pc_type mg > -pc_mg_galerkin -pc_mg_levels 5 -mg_levels_ksp_type richardson > -mg_levels_ksp_max_it 1 > > > > 0 KSP Residual norm 3.653965664551e-05 > 1 KSP Residual norm 1.910638846094e-06 > 2 KSP Residual norm 8.690440116045e-08 > 3 KSP Residual norm 3.732213639394e-09 > 4 KSP Residual norm 1.964855338020e-10 This converges well. > Max Max/Min Avg Total > Time (sec): 4.048e+00 1.00012 4.048e+00 > Objects: 2.490e+02 1.00000 2.490e+02 > Flops: 2.663e+08 1.00000 2.663e+08 2.130e+09 > Flops/sec: 6.579e+07 1.00012 6.579e+07 5.263e+08 > MPI Messages: 6.820e+02 1.00000 6.820e+02 5.456e+03 > MPI Message Lengths: 8.245e+06 1.00000 1.209e+04 6.596e+07 > MPI Reductions: 4.580e+02 1.00000 > VecTDot 12 1.0 2.9428e-02 1.2 6.29e+06 1.0 0.0e+00 0.0e+00 > 1.2e+01 1 2 0 0 3 1 2 0 0 3 1710 > VecNorm 9 1.0 1.0796e-02 1.2 4.72e+06 1.0 0.0e+00 0.0e+00 > 9.0e+00 0 2 0 0 2 0 2 0 0 2 3497 > VecScale 24 1.0 2.4652e-04 1.1 1.99e+05 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 6442 > VecCopy 3 1.0 5.0740e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 116 1.0 1.4349e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAXPY 12 1.0 2.8027e-02 1.0 6.29e+06 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 2 0 0 0 1 2 0 0 0 1796 > VecAYPX 29 1.0 3.0655e-02 1.4 4.16e+06 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 2 0 0 0 1 2 0 0 0 1085 > VecScatterBegin 123 1.0 3.5391e-02 1.1 0.00e+00 0.0 3.5e+03 1.2e+04 > 0.0e+00 1 0 65 66 0 1 0 65 66 0 0 > VecScatterEnd 123 1.0 2.5395e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMult 31 1.0 2.3556e-01 1.0 5.62e+07 1.0 1.0e+03 2.3e+04 > 0.0e+00 6 21 19 36 0 6 21 19 36 0 1908 > MatMultAdd 24 1.0 5.9044e-02 1.0 1.21e+07 1.0 5.8e+02 2.8e+03 > 0.0e+00 1 5 11 2 0 1 5 11 2 0 1644 > MatMultTranspose 28 1.0 7.4601e-02 1.1 1.42e+07 1.0 6.7e+02 2.8e+03 > 0.0e+00 2 5 12 3 0 2 5 12 3 0 1518 > MatSolve 6 1.0 3.8311e-03 1.0 1.44e+06 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 1 0 0 0 0 1 0 0 0 3006 > MatSOR 48 1.0 5.8050e-01 1.0 1.01e+08 1.0 8.6e+02 1.5e+04 > 4.8e+01 14 38 16 19 10 14 38 16 19 11 1390 Most of the solve time is in MatSOR and MatMult. That's expected since the subdomains are pretty big. > MatLUFactorSym 1 1.0 3.0620e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 1 0 0 0 0 1 0 > MatLUFactorNum 1 1.0 2.4665e-02 1.0 1.95e+07 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 7 0 0 0 1 7 0 0 0 6329 > MatAssemblyBegin 20 1.0 2.4351e-02 6.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.2e+01 0 0 0 0 5 0 0 0 0 5 0 > MatAssemblyEnd 20 1.0 1.3176e-01 1.0 0.00e+00 0.0 5.6e+02 2.1e+03 > 7.2e+01 3 0 10 2 16 3 0 10 2 16 0 > MatGetRowIJ 1 1.0 1.1516e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetOrdering 1 1.0 4.1008e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatView 16 1.3 1.0209e-03 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.2e+01 0 0 0 0 3 0 0 0 0 3 0 > MatPtAP 4 1.0 6.4001e-01 1.0 4.06e+07 1.0 1.1e+03 1.7e+04 > 1.0e+02 16 15 21 30 22 16 15 21 30 22 507 MatPtAP dominates the setup time. For profiling, you could register a stage (PetscLogStageRegister) and time the setup separately from the solve. > MatPtAPSymbolic 4 1.0 3.7003e-01 1.0 0.00e+00 0.0 7.2e+02 2.0e+04 > 6.0e+01 9 0 13 22 13 9 0 13 22 13 0 > MatPtAPNumeric 4 1.0 2.7004e-01 1.0 4.06e+07 1.0 4.2e+02 1.2e+04 > 4.0e+01 7 15 8 8 9 7 15 8 8 9 1202 > MatGetRedundant 1 1.0 7.9393e-04 1.0 0.00e+00 0.0 1.7e+02 7.1e+03 > 4.0e+00 0 0 3 2 1 0 0 3 2 1 0 > MatGetLocalMat 4 1.0 3.9521e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 8.0e+00 1 0 0 0 2 1 0 0 0 2 0 > MatGetBrAoCol 4 1.0 1.7719e-02 1.0 0.00e+00 0.0 4.3e+02 2.7e+04 > 8.0e+00 0 0 8 18 2 0 0 8 18 2 0 > MatGetSymTrans 8 1.0 1.3007e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSetUp 7 1.0 1.3097e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.4e+01 0 0 0 0 5 0 0 0 0 5 0 > KSPSolve 2 1.0 1.0450e+00 1.0 2.04e+08 1.0 3.4e+03 1.2e+04 > 7.5e+01 26 77 62 60 16 26 77 62 60 16 1563 > PCSetUp 1 1.0 8.6248e-01 1.0 6.21e+07 1.0 1.9e+03 1.1e+04 > 3.2e+02 21 23 35 32 69 21 23 35 32 69 576 > PCApply 6 1.0 8.4384e-01 1.0 1.61e+08 1.0 3.2e+03 9.0e+03 > 4.8e+01 21 60 59 44 10 21 60 59 44 11 1523 Do you know why there are 6 PCApply events? With four iterations of the Krylov method, there should be only 5 events. Oh, it looks like you do two solves. Is one of those with a different system? Recall that the old KSPSolve time was over 3.35 seconds. > -log_summary -ksp_monitor -ksp_view -ksp_converged_reason -pc_type mg > -pc_mg_galerkin -pc_mg_levels 5 -mg_levels_ksp_type richardson > -mg_levels_ksp_max_it 1 -pc_mg_type full > > 0 KSP Residual norm 3.654533581988e-05 > 1 KSP Residual norm 8.730776244351e-07 > 2 KSP Residual norm 3.474626061661e-08 > 3 KSP Residual norm 1.813665557493e-09 This converges slightly faster, but ends up not paying off. > Time (sec): 4.261e+00 1.00012 4.261e+00 > Objects: 2.950e+02 1.00000 2.950e+02 > Flops: 3.322e+08 1.00000 3.322e+08 2.658e+09 > Flops/sec: 7.797e+07 1.00012 7.796e+07 6.237e+08 > MPI Messages: 1.442e+03 1.00000 1.442e+03 1.154e+04 > MPI Message Lengths: 1.018e+07 1.00000 7.057e+03 8.141e+07 > MPI Reductions: 5.460e+02 1.00000 More messages, more work, etc., so not better. > KSPSolve 2 1.0 1.2287e+00 1.0 2.70e+08 1.0 9.5e+03 5.8e+03 > 1.6e+02 29 81 82 68 30 29 81 82 68 30 1758 > PCSetUp 1 1.0 8.6414e-01 1.0 6.21e+07 1.0 1.9e+03 1.1e+04 > 3.2e+02 20 19 17 26 58 20 19 17 26 58 575 > PCApply 5 1.0 1.0571e+00 1.0 2.33e+08 1.0 9.3e+03 4.9e+03 > 1.4e+02 24 70 81 56 26 24 70 81 56 26 1764 It's still entirely possible that you can make Full MG beat V-cycles, especially if you only need to converge up to discretization error. By my figures, your good solver takes 12 work units to converge well below discretization error (after Galerkin setup, but maybe you only need to do that once?). If you only need to equal truncation error, this can be brought down to about 5 (probably at best a 2x speedup in parallel). This would involve a high-order (cubic) FMG prolongation. Alternatively, you can speed up the implementation (and significantly reduce memory usage) by creating geometric coarse levels and a matrix-free implementation of MatSOR and MatMult. (The matrices are great for experimenting, but if this solver is mission critical and still a bottleneck, the matrix is an inefficient way to represent the operator since it has very low arithmetic intensity/requires a lot of memory bandwidth.) I predict you can probably speed up the solve by perhaps another factor of 2 with a good matrix-free FMG implementation. Do you want to go down this path?
pgpQRGdKhpzmu.pgp
Description: PGP signature