Yikes, it looks like we have been off the list this whole time. I am not the only PETSC developer nor the only person that knows about PETSc!
These folks have some strange behavior with GAMG going from 1 to 2 cores, using lots of memory, but one question that they have, that I don't understand either is this: >> Yea, my interpretation of these methods is also that " PetscMemoryGetMaximumUsage" should be >= "PetscMallocGetMaximumUsage". >> But you are seeing the opposite. We are using PETSc main and have found a case where memory consumption explodes in parallel. Also, we see a non-negligible difference between PetscMemoryGetMaximumUsage() and PetscMallocGetMaximumUsage(). Running in serial through /usr/bin/time, the max. resident set size matches the PetscMallocGetMaximumUsage() result. I would have expected it to match PetscMemoryGetMaximumUsage() instead. PetscMemoryGetMaximumUsage PetscMallocGetMaximumUsage Time Serial + Option 1 4.8 GB 7.4 GB 112 sec 2 core + Option1 15.2 GB 45.5 GB 150 sec Serial + Option 2 3.1 GB 3.8 GB 167 sec 2 core + Option2 13.1 GB 17.4 GB 89 sec Serial + Option 3 4.7GB 5.2GB 693 sec 2 core + Option 3 23.2 GB 26.4 GB 411 sec On Thu, Apr 18, 2024 at 4:13 PM Mark Adams <[email protected]> wrote: > The next thing you might try is not using the null space argument. > Hypre does not use it, but GAMG does. > You could also run with -malloc_view to see some info on mallocs. It is > probably in the Mat objects. > You can also run with "-info" and grep on GAMG in the output and send that. > > Mark > > On Thu, Apr 18, 2024 at 12:03 PM Ashish Patel <[email protected]> > wrote: > >> Hi Mark, >> >> Thanks for your response and suggestion. With hypre both memory and time >> looks good, here is the data for that >> >> PetscMemoryGetMaximumUsage >> PetscMallocGetMaximumUsage >> Time >> Serial + Option 4 >> 5.55 GB >> 5.17 GB >> 15.7 sec >> 2 core + Option 4 >> 5.85 GB >> 4.69 GB >> 21.9 sec >> >> Option 4 >> mpirun -n _ ./ex1 -A_name matrix.dat -b_name vector.dat -n_name >> _null_space.dat -num_near_nullspace 6 -ksp_type cg -pc_type hypre >> -pc_hypre_boomeramg_strong_threshold 0.9 -ksp_view -log_view >> -log_view_memory -info :pc >> >> I am also attaching a standalone program to reproduce these options and >> the link to matrix, rhs and near null spaces (serial.tar 2.xz >> <https://urldefense.us/v3/__https://ansys-my.sharepoint.com/:u:/p/ashish_patel/EbUM5Ahp-epNi4xDxR9mnN0B1dceuVzGhVXQQYJzI5Py2g__;!!G_uCfscf7eWS!ar7t_MsQ-W6SXcDyEWpSDZP_YngFSqVsz2D-8chGJHSz7IZzkLBvN4UpJ1GXrRBGyhEHqmDUQGBfqTKf5x_BPXo$ >> > >> ) if you would like to try as well. Please let me know if you have >> trouble accessing the link. >> >> Ashish >> ------------------------------ >> *From:* Mark Adams <[email protected]> >> *Sent:* Wednesday, April 17, 2024 7:52 PM >> *To:* Jeremy Theler (External) <[email protected]> >> *Cc:* Ashish Patel <[email protected]>; Scott McClennan < >> [email protected]> >> *Subject:* Re: About recent changes in GAMG >> >> >> *[External Sender]* >> >> >> On Wed, Apr 17, 2024 at 7:20 AM Jeremy Theler (External) < >> [email protected]> wrote: >> >> Hey Mark. Long time no see! How are thing going over there? >> >> We are using PETSc main and have found a case where memory consumption >> explodes in parallel. >> Also, we see a non-negligible difference between >> PetscMemoryGetMaximumUsage() and PetscMallocGetMaximumUsage(). >> Running in serial through /usr/bin/time, the max. resident set size >> matches the PetscMallocGetMaximumUsage() result. >> I would have expected it to match PetscMemoryGetMaximumUsage() instead. >> >> >> Yea, my interpretation of these methods is also that "Memory" should be >> >= "Malloc". But you are seeing the opposite. >> >> I don't have any idea what is going on with your big memory penalty going >> from 1 to 2 cores on this test, but the first thing to do is try other >> solvers and see how that behaves. Hypre in particular would be a good thing >> to try because it is a similar algorithm. >> >> Mark >> >> >> >> The matrix size is around 1 million. We can share it with you if you >> want, along with the RHS and the 6 near nullspace vectors and a modified >> ex1.c which will read these files and show the following behavior. >> >> Observations using latest main for elastic matrix with a block size of 3 >> (after removing bonded glue-like DOFs with direct elimination) and near >> null space provided >> >> - Big memory penalty going from serial to parallel (2 core) >> - Big difference between PetscMemoryGetMaximumUsage and >> PetscMallocGetMaximumUsage, why? >> - The memory penalty decreases with -pc_gamg_aggressive_square_graph false >> (option 2) >> - The difference between PetscMemoryGetMaximumUsage and >> PetscMallocGetMaximumUsage reduces when -pc_gamg_threshold is >> increased from 0 to 0.01 (option 3), the solve time increase a lot though. >> >> >> >> >> >> PetscMemoryGetMaximumUsage >> PetscMallocGetMaximumUsage >> Time >> Serial + Option 1 >> 4.8 GB >> 7.4 GB >> 112 sec >> 2 core + Option1 >> 15.2 GB >> 45.5 GB >> 150 sec >> Serial + Option 2 >> 3.1 GB >> 3.8 GB >> 167 sec >> 2 core + Option2 >> 13.1 GB >> 17.4 GB >> 89 sec >> Serial + Option 3 >> 4.7GB >> 5.2GB >> 693 sec >> 2 core + Option 3 >> 23.2 GB >> 26.4 GB >> 411 sec >> >> Option 1 >> mpirun -n _ ./ex1 -A_name matrix.dat -b_name vector.dat -n_name >> _null_space.dat -num_near_nullspace 6 -ksp_type cg -pc_type gamg >> -pc_gamg_coarse_eq_limit 1000 -ksp_view -log_view -log_view_memory >> -pc_gamg_aggressive_square_graph true -pc_gamg_threshold 0.0 -info :pc >> >> Option 2 >> mpirun -n _ ./ex1 -A_name matrix.dat -b_name vector.dat -n_name >> _null_space.dat -num_near_nullspace 6 -ksp_type cg -pc_type gamg >> -pc_gamg_coarse_eq_limit 1000 -ksp_view -log_view -log_view_memory >> -pc_gamg_aggressive_square_graph *false* -pc_gamg_threshold 0.0 -info :pc >> >> Option 3 >> mpirun -n _ ./ex1 -A_name matrix.dat -b_name vector.dat -n_name >> _null_space.dat -num_near_nullspace 6 -ksp_type cg -pc_type gamg >> -pc_gamg_coarse_eq_limit 1000 -ksp_view -log_view -log_view_memory >> -pc_gamg_aggressive_square_graph true -pc_gamg_threshold *0.01* -info :pc >> ------------------------------ >> *From:* Mark Adams <[email protected]> >> *Sent:* Tuesday, November 14, 2023 11:28 AM >> *To:* Jeremy Theler (External) <[email protected]> >> *Cc:* Ashish Patel <[email protected]> >> *Subject:* Re: About recent changes in GAMG >> >> >> *[External Sender]* >> Sounds good, >> >> I think the not-square graph "aggressive" coarsening is only issue that I >> see and you can fix this by using: >> >> -mat_coarsen_type mis >> >> Aside, '-pc_gamg_aggressive_square_graph' should do it also, and you can >> use both and they will be ignored in earlier versions. >> >> If you see a difference then the first thing to do is run with '-info >> :pc' and send that to me (you can grep on 'GAMG' and send that if you like >> to reduce the data). >> >> Thanks, >> Mark >> >> >> On Tue, Nov 14, 2023 at 8:49 AM Jeremy Theler (External) < >> [email protected]> wrote: >> >> Hi Mark. >> Thanks for reaching out. For now, we are going to stick to 3.19 for our >> production code because the changes in 3.20 impact in our tests in >> different ways (some of them perform better, some perform worse). >> I now switched to another task about investigating structural elements in >> DMplex. >> I'll go back to analyzing the new changes in GAMG in a couple of weeks so >> we can then see if we upgrade to 3.20 or we wait until 3.21. >> >> Thanks for your work and your kindness. >> -- >> jeremy >> ------------------------------ >> *From:* Mark Adams <[email protected]> >> *Sent:* Tuesday, November 14, 2023 9:35 AM >> *To:* Jeremy Theler (External) <[email protected]> >> *Cc:* Ashish Patel <[email protected]> >> *Subject:* Re: About recent changes in GAMG >> >> >> *[External Sender]* >> Hi Jeremy, >> >> Just following up. >> I appreciate your digging into performance regressions in GAMG. >> AMG is really a pain sometimes and we want GAMG to be solid, at least for >> mainstream options, and your efforts are appreciated. >> So feel free to start this discussion up. >> >> Thanks, >> Mark >> >> On Wed, Oct 25, 2023 at 9:52 PM Jeremy Theler (External) < >> [email protected]> wrote: >> >> Dear Mark >> >> Thanks for the follow up and sorry for the delay. >> I'm taking some days off. I'll be back to full throttle next week so can >> continue the discussion about these changes in GAMG. >> >> Regards, >> Jeremy >> >> ------------------------------ >> *From:* Mark Adams <[email protected]> >> *Sent:* Wednesday, October 18, 2023 9:15 AM >> *To:* Jeremy Theler (External) <[email protected]>; PETSc >> users list <[email protected]> >> *Cc:* Ashish Patel <[email protected]> >> *Subject:* Re: About recent changes in GAMG >> >> >> *[External Sender]* >> Hi Jeremy, >> >> I hope you don't mind putting this on the list (w/o data), but this is >> documentation and you are the second user that found regressions. >> Sorry for the churn. >> >> There is a lot here so we can iterate, but here is a pass at your >> questions. >> >> *** Using MIS-2 instead of square graph was motivated by setup >> cost/performance but on GPUs with some recent fixes in Kokkos (in a branch) >> square graph seems OK. >> My experience was that square graph is better in terms of quality and we >> have a power user, like you all, that found this also. >> So I switched the default back to square graph. >> >> Interesting that you found that MIS-2 (new method) could be faster, but >> it might be because the two methods coarsen at different rates and that can >> make a big difference. >> (the way to test would be to adjust parameters to get similar coarsen >> rates, but I digress) >> It's hard to understand the differences between these two methods in >> terms of aggregate quality so we need to just experiment and have options. >> >> *** As far as your thermal problem. There was a complaint that the eigen >> estimates for chebyshev smoother were not recomputed for nonlinear problems >> and I added an option to do that and turned it on by default: >> Use '-pc_gamg_recompute_esteig false' to get back to the original. >> (I should have turned it off by default) >> >> Now, if your problem is symmetric and you use CG to compute the eigen >> estimates there should be no difference. >> If you use CG to compute the eigen estimates in GAMG (and have GAMG give >> them to cheby, the default) that when you recompute the eigen estimates the >> cheby eigen estimator is used and that will use gmres by default unless you >> set the SPD property in your matrix. >> So if you set '-pc_gamg_esteig_ksp_type cg' you want to also set >> '-mg_levels_esteig_ksp_type cg' (verify with -ksp_view and -options_left) >> CG is a much better estimator for SPD. >> >> And I found that the cheby eigen estimator uses an LAPACK *eigen* method >> to compute the eigen bounds and GAMG uses a *singular value* method. >> The two give very different results on the lid driven cavity test (ex19). >> eigen is lower, which is safer but not optimal if it is too low. >> I have a branch to have cheby use the singular value method, but I don't >> plan on merging it (enough churn and I don't understand these differences). >> >> *** '-pc_gamg_low_memory_threshold_filter false' recovers the old >> filtering method. >> This is the default now because there is a bug in the (new) low memory >> filter. >> This bug is very rare and catastrophic. >> We are working on it and will turn it on by default when it's fixed. >> This does not affect the semantics of the solver, just work and memory >> complexity. >> >> *** As far as tet4 vs tet10, I would guess that tet4 wants more >> aggressive coarsening. >> The default is to do aggressive on one (1) level. >> You might want more levels for tet4. >> And the new MIS-k coarsening can use any k (default is 2) wth >> '-mat_coarsen_misk_distance k' (eg, k=3) >> I have not added hooks to have a more complex schedule to specify the >> method on each level. >> >> Thanks, >> Mark >> >> On Tue, Oct 17, 2023 at 9:33 PM Jeremy Theler (External) < >> [email protected]> wrote: >> >> Hey Mark >> >> Regarding the changes in the coarsening algorithm in 3.20 with respect to >> 3.19 in general we see that for some problems the MIS strategy gives and >> overall performance which is slightly better and for some others it is >> slightly worse than the "baseline" from 3.19. >> We also saw that current main has switched back to the old square >> coarsening algorithm by default, which again, in some cases is better and >> in others is worse than 3.19 without any extra command-line option. >> >> Now what seems weird to us is that we have a test case which is a heat >> conduction problem with radiation boundary conditions (so it is non linear) >> using tet10 and we see >> >> 1. that in parallel v3.20 is way worse than v3.19, although the >> memory usage is similar >> 2. that petsc main (with no extra flags, just the defaults) recover >> the 3.19 performance but memory usage is significantly larger >> >> >> I tried using the -pc_gamg_low_memory_threshold_filter flag and the >> results were the same. >> >> Find attached the log and snes views of 3.19, 3.20 and main using 4 MPI >> ranks. >> Is there any explanation about these two points we are seeing? >> Another weird finding is that if we use tet4 instead of tet10, v3.20 is >> only 10% slower than the other two and main does not need more memory than >> the other two. >> >> BTW, I have dozens of other log view outputs comparing 3.19, 3.20 and >> main should you be interested. >> >> Let me know if it is better to move this discussion into the PETSc >> mailing list. >> >> Regards, >> jeremy theler >> >> >>
