Should we have some simple selection of default algorithms based on problem size/number of processes? For example if using more than 1000 processes then use scalable version etc? How would we decide on the parameter values?
Barry > On Sep 15, 2016, at 5:35 AM, Dave May <dave.mayhe...@gmail.com> wrote: > > HI all, > > I the only unexpected memory usage I can see is associated with the call to > MatPtAP(). > Here is something you can try immediately. > Run your code with the additional options > -matrap 0 -matptap_scalable > > I didn't realize this before, but the default behaviour of MatPtAP in > parallel is actually to to explicitly form the transpose of P (e.g. assemble > R = P^T) and then compute R.A.P. > You don't want to do this. The option -matrap 0 resolves this issue. > > The implementation of P^T.A.P has two variants. > The scalable implementation (with respect to memory usage) is selected via > the second option -matptap_scalable. > > Try it out - I see a significant memory reduction using these options for > particular mesh sizes / partitions. > > I've attached a cleaned up version of the code you sent me. > There were a number of memory leaks and other issues. > The main points being > * You should call DMDAVecGetArrayF90() before VecAssembly{Begin,End} > * You should call PetscFinalize(), otherwise the option -log_summary > (-log_view) will not display anything once the program has completed. > > > Thanks, > Dave > > > On 15 September 2016 at 08:03, Hengjie Wang <hengj...@uci.edu> wrote: > Hi Dave, > > Sorry, I should have put more comment to explain the code. > The number of process in each dimension is the same: Px = Py=Pz=P. So is the > domain size. > So if the you want to run the code for a 512^3 grid points on 16^3 cores, > you need to set "-N 512 -P 16" in the command line. > I add more comments and also fix an error in the attached code. ( The error > only effects the accuracy of solution but not the memory usage. ) > > Thank you. > Frank > > > On 9/14/2016 9:05 PM, Dave May wrote: >> >> >> On Thursday, 15 September 2016, Dave May <dave.mayhe...@gmail.com> wrote: >> >> >> On Thursday, 15 September 2016, frank <hengj...@uci.edu> wrote: >> Hi, >> >> I write a simple code to re-produce the error. I hope this can help to >> diagnose the problem. >> The code just solves a 3d poisson equation. >> >> Why is the stencil width a runtime parameter?? And why is the default value >> 2? For 7-pnt FD Laplace, you only need a stencil width of 1. >> >> Was this choice made to mimic something in the real application code? >> >> Please ignore - I misunderstood your usage of the param set by -P >> >> >> >> I run the code on a 1024^3 mesh. The process partition is 32 * 32 * 32. >> That's when I re-produce the OOM error. Each core has about 2G memory. >> I also run the code on a 512^3 mesh with 16 * 16 * 16 processes. The ksp >> solver works fine. >> I attached the code, ksp_view_pre's output and my petsc option file. >> >> Thank you. >> Frank >> >> On 09/09/2016 06:38 PM, Hengjie Wang wrote: >>> Hi Barry, >>> >>> I checked. On the supercomputer, I had the option "-ksp_view_pre" but it is >>> not in file I sent you. I am sorry for the confusion. >>> >>> Regards, >>> Frank >>> >>> On Friday, September 9, 2016, Barry Smith <bsm...@mcs.anl.gov> wrote: >>> >>> > On Sep 9, 2016, at 3:11 PM, frank <hengj...@uci.edu> wrote: >>> > >>> > Hi Barry, >>> > >>> > I think the first KSP view output is from -ksp_view_pre. Before I >>> > submitted the test, I was not sure whether there would be OOM error or >>> > not. So I added both -ksp_view_pre and -ksp_view. >>> >>> But the options file you sent specifically does NOT list the >>> -ksp_view_pre so how could it be from that? >>> >>> Sorry to be pedantic but I've spent too much time in the past trying to >>> debug from incorrect information and want to make sure that the information >>> I have is correct before thinking. Please recheck exactly what happened. >>> Rerun with the exact input file you emailed if that is needed. >>> >>> Barry >>> >>> > >>> > Frank >>> > >>> > >>> > On 09/09/2016 12:38 PM, Barry Smith wrote: >>> >> Why does ksp_view2.txt have two KSP views in it while ksp_view1.txt >>> >> has only one KSPView in it? Did you run two different solves in the 2 >>> >> case but not the one? >>> >> >>> >> Barry >>> >> >>> >> >>> >> >>> >>> On Sep 9, 2016, at 10:56 AM, frank <hengj...@uci.edu> wrote: >>> >>> >>> >>> Hi, >>> >>> >>> >>> I want to continue digging into the memory problem here. >>> >>> I did find a work around in the past, which is to use less cores per >>> >>> node so that each core has 8G memory. However this is deficient and >>> >>> expensive. I hope to locate the place that uses the most memory. >>> >>> >>> >>> Here is a brief summary of the tests I did in past: >>> >>>> Test1: Mesh 1536*128*384 | Process Mesh 48*4*12 >>> >>> Maximum (over computational time) process memory: total >>> >>> 7.0727e+08 >>> >>> Current process memory: >>> >>> total 7.0727e+08 >>> >>> Maximum (over computational time) space PetscMalloc()ed: total >>> >>> 6.3908e+11 >>> >>> Current space PetscMalloc()ed: >>> >>> total 1.8275e+09 >>> >>> >>> >>>> Test2: Mesh 1536*128*384 | Process Mesh 96*8*24 >>> >>> Maximum (over computational time) process memory: total >>> >>> 5.9431e+09 >>> >>> Current process memory: >>> >>> total 5.9431e+09 >>> >>> Maximum (over computational time) space PetscMalloc()ed: total >>> >>> 5.3202e+12 >>> >>> Current space PetscMalloc()ed: >>> >>> total 5.4844e+09 >>> >>> >>> >>>> Test3: Mesh 3072*256*768 | Process Mesh 96*8*24 >>> >>> OOM( Out Of Memory ) killer of the supercomputer terminated the job >>> >>> during "KSPSolve". >>> >>> >>> >>> I attached the output of ksp_view( the third test's output is from >>> >>> ksp_view_pre ), memory_view and also the petsc options. >>> >>> >>> >>> In all the tests, each core can access about 2G memory. In test3, there >>> >>> are 4223139840 non-zeros in the matrix. This will consume about 1.74M, >>> >>> using double precision. Considering some extra memory used to store >>> >>> integer index, 2G memory should still be way enough. >>> >>> >>> >>> Is there a way to find out which part of KSPSolve uses the most memory? >>> >>> Thank you so much. >>> >>> >>> >>> BTW, there are 4 options remains unused and I don't understand why they >>> >>> are omitted: >>> >>> -mg_coarse_telescope_mg_coarse_ksp_type value: preonly >>> >>> -mg_coarse_telescope_mg_coarse_pc_type value: bjacobi >>> >>> -mg_coarse_telescope_mg_levels_ksp_max_it value: 1 >>> >>> -mg_coarse_telescope_mg_levels_ksp_type value: richardson >>> >>> >>> >>> >>> >>> Regards, >>> >>> Frank >>> >>> >>> >>> On 07/13/2016 05:47 PM, Dave May wrote: >>> >>>> >>> >>>> On 14 July 2016 at 01:07, frank <hengj...@uci.edu> wrote: >>> >>>> Hi Dave, >>> >>>> >>> >>>> Sorry for the late reply. >>> >>>> Thank you so much for your detailed reply. >>> >>>> >>> >>>> I have a question about the estimation of the memory usage. There are >>> >>>> 4223139840 allocated non-zeros and 18432 MPI processes. Double >>> >>>> precision is used. So the memory per process is: >>> >>>> 4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ? >>> >>>> Did I do sth wrong here? Because this seems too small. >>> >>>> >>> >>>> No - I totally f***ed it up. You are correct. That'll teach me for >>> >>>> fumbling around with my iphone calculator and not using my brain. >>> >>>> (Note that to convert to MB just divide by 1e6, not 1024^2 - although >>> >>>> I apparently cannot convert between units correctly....) >>> >>>> >>> >>>> From the PETSc objects associated with the solver, It looks like it >>> >>>> _should_ run with 2GB per MPI rank. Sorry for my mistake. >>> >>>> Possibilities are: somewhere in your usage of PETSc you've introduced >>> >>>> a memory leak; PETSc is doing a huge over allocation (e.g. as per our >>> >>>> discussion of MatPtAP); or in your application code there are other >>> >>>> objects you have forgotten to log the memory for. >>> >>>> >>> >>>> >>> >>>> >>> >>>> I am running this job on Bluewater >>> >>>> I am using the 7 points FD stencil in 3D. >>> >>>> >>> >>>> I thought so on both counts. >>> >>>> >>> >>>> I apologize that I made a stupid mistake in computing the memory per >>> >>>> core. My settings render each core can access only 2G memory on >>> >>>> average instead of 8G which I mentioned in previous email. I re-run >>> >>>> the job with 8G memory per core on average and there is no "Out Of >>> >>>> Memory" error. I would do more test to see if there is still some >>> >>>> memory issue. >>> >>>> >>> >>>> Ok. I'd still like to know where the memory was being used since my >>> >>>> estimates were off. >>> >>>> >>> >>>> >>> >>>> Thanks, >>> >>>> Dave >>> >>>> >>> >>>> Regards, >>> >>>> Frank >>> >>>> >>> >>>> >>> >>>> >>> >>>> On 07/11/2016 01:18 PM, Dave May wrote: >>> >>>>> Hi Frank, >>> >>>>> >>> >>>>> >>> >>>>> On 11 July 2016 at 19:14, frank <hengj...@uci.edu> wrote: >>> >>>>> Hi Dave, >>> >>>>> >>> >>>>> I re-run the test using bjacobi as the preconditioner on the coarse >>> >>>>> mesh of telescope. The Grid is 3072*256*768 and process mesh is >>> >>>>> 96*8*24. The petsc option file is attached. >>> >>>>> I still got the "Out Of Memory" error. The error occurred before the >>> >>>>> linear solver finished one step. So I don't have the full info from >>> >>>>> ksp_view. The info from ksp_view_pre is attached. >>> >>>>> >>> >>>>> Okay - that is essentially useless (sorry) >>> >>>>> >>> >>>>> It seems to me that the error occurred when the decomposition was >>> >>>>> going to be changed. >>> >>>>> >>> >>>>> Based on what information? >>> >>>>> Running with -info would give us more clues, but will create a ton of >>> >>>>> output. >>> >>>>> Please try running the case which failed with -info >>> >>>>> I had another test with a grid of 1536*128*384 and the same process >>> >>>>> mesh as above. There was no error. The ksp_view info is attached for >>> >>>>> comparison. >>> >>>>> Thank you. >>> >>>>> >>> >>>>> >>> >>>>> [3] Here is my crude estimate of your memory usage. >>> >>>>> I'll target the biggest memory hogs only to get an order of magnitude >>> >>>>> estimate >>> >>>>> >>> >>>>> * The Fine grid operator contains 4223139840 non-zeros --> 1.8 GB per >>> >>>>> MPI rank assuming double precision. >>> >>>>> The indices for the AIJ could amount to another 0.3 GB (assuming 32 >>> >>>>> bit integers) >>> >>>>> >>> >>>>> * You use 5 levels of coarsening, so the other operators should >>> >>>>> represent (collectively) >>> >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4 ~ 300 MB per MPI rank on the >>> >>>>> communicator with 18432 ranks. >>> >>>>> The coarse grid should consume ~ 0.5 MB per MPI rank on the >>> >>>>> communicator with 18432 ranks. >>> >>>>> >>> >>>>> * You use a reduction factor of 64, making the new communicator with >>> >>>>> 288 MPI ranks. >>> >>>>> PCTelescope will first gather a temporary matrix associated with your >>> >>>>> coarse level operator assuming a comm size of 288 living on the comm >>> >>>>> with size 18432. >>> >>>>> This matrix will require approximately 0.5 * 64 = 32 MB per core on >>> >>>>> the 288 ranks. >>> >>>>> This matrix is then used to form a new MPIAIJ matrix on the subcomm, >>> >>>>> thus require another 32 MB per rank. >>> >>>>> The temporary matrix is now destroyed. >>> >>>>> >>> >>>>> * Because a DMDA is detected, a permutation matrix is assembled. >>> >>>>> This requires 2 doubles per point in the DMDA. >>> >>>>> Your coarse DMDA contains 92 x 16 x 48 points. >>> >>>>> Thus the permutation matrix will require < 1 MB per MPI rank on the >>> >>>>> sub-comm. >>> >>>>> >>> >>>>> * Lastly, the matrix is permuted. This uses MatPtAP(), but the >>> >>>>> resulting operator will have the same memory footprint as the >>> >>>>> unpermuted matrix (32 MB). At any stage in PCTelescope, only 2 >>> >>>>> operators of size 32 MB are held in memory when the DMDA is provided. >>> >>>>> >>> >>>>> From my rough estimates, the worst case memory foot print for any >>> >>>>> given core, given your options is approximately >>> >>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB = 2465 MB >>> >>>>> This is way below 8 GB. >>> >>>>> >>> >>>>> Note this estimate completely ignores: >>> >>>>> (1) the memory required for the restriction operator, >>> >>>>> (2) the potential growth in the number of non-zeros per row due to >>> >>>>> Galerkin coarsening (I wished -ksp_view_pre reported the output from >>> >>>>> MatView so we could see the number of non-zeros required by the >>> >>>>> coarse level operators) >>> >>>>> (3) all temporary vectors required by the CG solver, and those >>> >>>>> required by the smoothers. >>> >>>>> (4) internal memory allocated by MatPtAP >>> >>>>> (5) memory associated with IS's used within PCTelescope >>> >>>>> >>> >>>>> So either I am completely off in my estimates, or you have not >>> >>>>> carefully estimated the memory usage of your application code. >>> >>>>> Hopefully others might examine/correct my rough estimates >>> >>>>> >>> >>>>> Since I don't have your code I cannot access the latter. >>> >>>>> Since I don't have access to the same machine you are running on, I >>> >>>>> think we need to take a step back. >>> >>>>> >>> >>>>> [1] What machine are you running on? Send me a URL if its available >>> >>>>> >>> >>>>> [2] What discretization are you using? (I am guessing a scalar 7 >>> >>>>> point FD stencil) >>> >>>>> If it's a 7 point FD stencil, we should be able to examine the memory >>> >>>>> usage of your solver configuration using a standard, light weight >>> >>>>> existing PETSc example, run on your machine at the same scale. >>> >>>>> This would hopefully enable us to correctly evaluate the actual >>> >>>>> memory usage required by the solver configuration you are using. >>> >>>>> >>> >>>>> Thanks, >>> >>>>> Dave >>> >>>>> >>> >>>>> >>> >>>>> Frank >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> On 07/08/2016 10:38 PM, Dave May wrote: >>> >>>>>> >>> >>>>>> On Saturday, 9 July 2016, frank <hengj...@uci.edu> wrote: >>> >>>>>> Hi Barry and Dave, >>> >>>>>> >>> >>>>>> Thank both of you for the advice. >>> >>>>>> >>> >>>>>> @Barry >>> >>>>>> I made a mistake in the file names in last email. I attached the >>> >>>>>> correct files this time. >>> >>>>>> For all the three tests, 'Telescope' is used as the coarse >>> >>>>>> preconditioner. >>> >>>>>> >>> >>>>>> == Test1: Grid: 1536*128*384, Process Mesh: 48*4*12 >>> >>>>>> Part of the memory usage: Vector 125 124 3971904 0. >>> >>>>>> Matrix 101 101 >>> >>>>>> 9462372 0 >>> >>>>>> >>> >>>>>> == Test2: Grid: 1536*128*384, Process Mesh: 96*8*24 >>> >>>>>> Part of the memory usage: Vector 125 124 681672 0. >>> >>>>>> Matrix 101 101 >>> >>>>>> 1462180 0. >>> >>>>>> >>> >>>>>> In theory, the memory usage in Test1 should be 8 times of Test2. In >>> >>>>>> my case, it is about 6 times. >>> >>>>>> >>> >>>>>> == Test3: Grid: 3072*256*768, Process Mesh: 96*8*24. Sub-domain >>> >>>>>> per process: 32*32*32 >>> >>>>>> Here I get the out of memory error. >>> >>>>>> >>> >>>>>> I tried to use -mg_coarse jacobi. In this way, I don't need to set >>> >>>>>> -mg_coarse_ksp_type and -mg_coarse_pc_type explicitly, right? >>> >>>>>> The linear solver didn't work in this case. Petsc output some errors. >>> >>>>>> >>> >>>>>> @Dave >>> >>>>>> In test3, I use only one instance of 'Telescope'. On the coarse mesh >>> >>>>>> of 'Telescope', I used LU as the preconditioner instead of SVD. >>> >>>>>> If my set the levels correctly, then on the last coarse mesh of MG >>> >>>>>> where it calls 'Telescope', the sub-domain per process is 2*2*2. >>> >>>>>> On the last coarse mesh of 'Telescope', there is only one grid point >>> >>>>>> per process. >>> >>>>>> I still got the OOM error. The detailed petsc option file is >>> >>>>>> attached. >>> >>>>>> >>> >>>>>> Do you understand the expected memory usage for the particular >>> >>>>>> parallel LU implementation you are using? I don't (seriously). >>> >>>>>> Replace LU with bjacobi and re-run this test. My point about solver >>> >>>>>> debugging is still valid. >>> >>>>>> >>> >>>>>> And please send the result of KSPView so we can see what is actually >>> >>>>>> used in the computations >>> >>>>>> >>> >>>>>> Thanks >>> >>>>>> Dave >>> >>>>>> >>> >>>>>> >>> >>>>>> Thank you so much. >>> >>>>>> >>> >>>>>> Frank >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote: >>> >>>>>> On Jul 6, 2016, at 4:19 PM, frank <hengj...@uci.edu> wrote: >>> >>>>>> >>> >>>>>> Hi Barry, >>> >>>>>> >>> >>>>>> Thank you for you advice. >>> >>>>>> I tried three test. In the 1st test, the grid is 3072*256*768 and >>> >>>>>> the process mesh is 96*8*24. >>> >>>>>> The linear solver is 'cg' the preconditioner is 'mg' and 'telescope' >>> >>>>>> is used as the preconditioner at the coarse mesh. >>> >>>>>> The system gives me the "Out of Memory" error before the linear >>> >>>>>> system is completely solved. >>> >>>>>> The info from '-ksp_view_pre' is attached. I seems to me that the >>> >>>>>> error occurs when it reaches the coarse mesh. >>> >>>>>> >>> >>>>>> The 2nd test uses a grid of 1536*128*384 and process mesh is >>> >>>>>> 96*8*24. The 3rd test >>> >>>>>> uses the same grid but a different process mesh 48*4*12. >>> >>>>>> Are you sure this is right? The total matrix and vector memory >>> >>>>>> usage goes from 2nd test >>> >>>>>> Vector 384 383 8,193,712 0. >>> >>>>>> Matrix 103 103 11,508,688 0. >>> >>>>>> to 3rd test >>> >>>>>> Vector 384 383 1,590,520 0. >>> >>>>>> Matrix 103 103 3,508,664 0. >>> >>>>>> that is the memory usage got smaller but if you have only 1/8th the >>> >>>>>> processes and the same grid it should have gotten about 8 times >>> >>>>>> bigger. Did you maybe cut the grid by a factor of 8 also? If so that >>> >>>>>> still doesn't explain it because the memory usage changed by a >>> >>>>>> factor of 5 something for the vectors and 3 something for the >>> >>>>>> matrices. >>> >>>>>> >>> >>>>>> >>> >>>>>> The linear solver and petsc options in 2nd and 3rd tests are the >>> >>>>>> same in 1st test. The linear solver works fine in both test. >>> >>>>>> I attached the memory usage of the 2nd and 3rd tests. The memory >>> >>>>>> info is from the option '-log_summary'. I tried to use >>> >>>>>> '-momery_info' as you suggested, but in my case petsc treated it as >>> >>>>>> an unused option. It output nothing about the memory. Do I need to >>> >>>>>> add sth to my code so I can use '-memory_info'? >>> >>>>>> Sorry, my mistake the option is -memory_view >>> >>>>>> >>> >>>>>> Can you run the one case with -memory_view and -mg_coarse jacobi >>> >>>>>> -ksp_max_it 1 (just so it doesn't iterate forever) to see how much >>> >>>>>> memory is used without the telescope? Also run case 2 the same way. >>> >>>>>> >>> >>>>>> Barry >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> In both tests the memory usage is not large. >>> >>>>>> >>> >>>>>> It seems to me that it might be the 'telescope' preconditioner that >>> >>>>>> allocated a lot of memory and caused the error in the 1st test. >>> >>>>>> Is there is a way to show how much memory it allocated? >>> >>>>>> >>> >>>>>> Frank >>> >>>>>> >>> >>>>>> On 07/05/2016 03:37 PM, Barry Smith wrote: >>> >>>>>> Frank, >>> >>>>>> >>> >>>>>> You can run with -ksp_view_pre to have it "view" the KSP before >>> >>>>>> the solve so hopefully it gets that far. >>> >>>>>> >>> >>>>>> Please run the problem that does fit with -memory_info when >>> >>>>>> the problem completes it will show the "high water mark" for PETSc >>> >>>>>> allocated memory and total memory used. We first want to look at >>> >>>>>> these numbers to see if it is using more memory than you expect. You >>> >>>>>> could also run with say half the grid spacing to see how the memory >>> >>>>>> usage scaled with the increase in grid points. Make the runs also >>> >>>>>> with -log_view and send all the output from these options. >>> >>>>>> >>> >>>>>> Barry >>> >>>>>> >>> >>>>>> On Jul 5, 2016, at 5:23 PM, frank <hengj...@uci.edu> wrote: >>> >>>>>> >>> >>>>>> Hi, >>> >>>>>> >>> >>>>>> I am using the CG ksp solver and Multigrid preconditioner to solve >>> >>>>>> a linear system in parallel. >>> >>>>>> I chose to use the 'Telescope' as the preconditioner on the coarse >>> >>>>>> mesh for its good performance. >>> >>>>>> The petsc options file is attached. >>> >>>>>> >>> >>>>>> The domain is a 3d box. >>> >>>>>> It works well when the grid is 1536*128*384 and the process mesh is >>> >>>>>> 96*8*24. When I double the size of grid and >>> >>>>>> keep the same process mesh and petsc options, >>> >>>>>> I get an "out of memory" error from the super-cluster I am using. >>> >>>>>> Each process has access to at least 8G memory, which should be more >>> >>>>>> than enough for my application. I am sure that all the other parts >>> >>>>>> of my code( except the linear solver ) do not use much memory. So I >>> >>>>>> doubt if there is something wrong with the linear solver. >>> >>>>>> The error occurs before the linear system is completely solved so I >>> >>>>>> don't have the info from ksp view. I am not able to re-produce the >>> >>>>>> error with a smaller problem either. >>> >>>>>> In addition, I tried to use the block jacobi as the preconditioner >>> >>>>>> with the same grid and same decomposition. The linear solver runs >>> >>>>>> extremely slow but there is no memory error. >>> >>>>>> >>> >>>>>> How can I diagnose what exactly cause the error? >>> >>>>>> Thank you so much. >>> >>>>>> >>> >>>>>> Frank >>> >>>>>> <petsc_options.txt> >>> >>>>>> <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt> >>> >>>>>> >>> >>>>> >>> >>>> >>> >>> <ksp_view1.txt><ksp_view2.txt><ksp_view3.txt><memory1.txt><memory2.txt><petsc_options1.txt><petsc_options2.txt><petsc_options3.txt> >>> > >>> >> >> > > > <test_ksp.F90>