Hi Barry, I checked. On the supercomputer, I had the option "-ksp_view_pre" but it is not in file I sent you. I am sorry for the confusion.
Regards, Frank On Friday, September 9, 2016, Barry Smith <bsm...@mcs.anl.gov> wrote: > > > On Sep 9, 2016, at 3:11 PM, frank <hengj...@uci.edu <javascript:;>> > wrote: > > > > Hi Barry, > > > > I think the first KSP view output is from -ksp_view_pre. Before I > submitted the test, I was not sure whether there would be OOM error or not. > So I added both -ksp_view_pre and -ksp_view. > > But the options file you sent specifically does NOT list the > -ksp_view_pre so how could it be from that? > > Sorry to be pedantic but I've spent too much time in the past trying to > debug from incorrect information and want to make sure that the information > I have is correct before thinking. Please recheck exactly what happened. > Rerun with the exact input file you emailed if that is needed. > > Barry > > > > > Frank > > > > > > On 09/09/2016 12:38 PM, Barry Smith wrote: > >> Why does ksp_view2.txt have two KSP views in it while ksp_view1.txt > has only one KSPView in it? Did you run two different solves in the 2 case > but not the one? > >> > >> Barry > >> > >> > >> > >>> On Sep 9, 2016, at 10:56 AM, frank <hengj...@uci.edu <javascript:;>> > wrote: > >>> > >>> Hi, > >>> > >>> I want to continue digging into the memory problem here. > >>> I did find a work around in the past, which is to use less cores per > node so that each core has 8G memory. However this is deficient and > expensive. I hope to locate the place that uses the most memory. > >>> > >>> Here is a brief summary of the tests I did in past: > >>>> Test1: Mesh 1536*128*384 | Process Mesh 48*4*12 > >>> Maximum (over computational time) process memory: total > 7.0727e+08 > >>> Current process memory: > total 7.0727e+08 > >>> Maximum (over computational time) space PetscMalloc()ed: total > 6.3908e+11 > >>> Current space PetscMalloc()ed: > total 1.8275e+09 > >>> > >>>> Test2: Mesh 1536*128*384 | Process Mesh 96*8*24 > >>> Maximum (over computational time) process memory: total > 5.9431e+09 > >>> Current process memory: > total 5.9431e+09 > >>> Maximum (over computational time) space PetscMalloc()ed: total > 5.3202e+12 > >>> Current space PetscMalloc()ed: > total 5.4844e+09 > >>> > >>>> Test3: Mesh 3072*256*768 | Process Mesh 96*8*24 > >>> OOM( Out Of Memory ) killer of the supercomputer terminated the > job during "KSPSolve". > >>> > >>> I attached the output of ksp_view( the third test's output is from > ksp_view_pre ), memory_view and also the petsc options. > >>> > >>> In all the tests, each core can access about 2G memory. In test3, > there are 4223139840 non-zeros in the matrix. This will consume about > 1.74M, using double precision. Considering some extra memory used to store > integer index, 2G memory should still be way enough. > >>> > >>> Is there a way to find out which part of KSPSolve uses the most memory? > >>> Thank you so much. > >>> > >>> BTW, there are 4 options remains unused and I don't understand why > they are omitted: > >>> -mg_coarse_telescope_mg_coarse_ksp_type value: preonly > >>> -mg_coarse_telescope_mg_coarse_pc_type value: bjacobi > >>> -mg_coarse_telescope_mg_levels_ksp_max_it value: 1 > >>> -mg_coarse_telescope_mg_levels_ksp_type value: richardson > >>> > >>> > >>> Regards, > >>> Frank > >>> > >>> On 07/13/2016 05:47 PM, Dave May wrote: > >>>> > >>>> On 14 July 2016 at 01:07, frank <hengj...@uci.edu <javascript:;>> > wrote: > >>>> Hi Dave, > >>>> > >>>> Sorry for the late reply. > >>>> Thank you so much for your detailed reply. > >>>> > >>>> I have a question about the estimation of the memory usage. There are > 4223139840 allocated non-zeros and 18432 MPI processes. Double precision is > used. So the memory per process is: > >>>> 4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ? > >>>> Did I do sth wrong here? Because this seems too small. > >>>> > >>>> No - I totally f***ed it up. You are correct. That'll teach me for > fumbling around with my iphone calculator and not using my brain. (Note > that to convert to MB just divide by 1e6, not 1024^2 - although I > apparently cannot convert between units correctly....) > >>>> > >>>> From the PETSc objects associated with the solver, It looks like it > _should_ run with 2GB per MPI rank. Sorry for my mistake. Possibilities > are: somewhere in your usage of PETSc you've introduced a memory leak; > PETSc is doing a huge over allocation (e.g. as per our discussion of > MatPtAP); or in your application code there are other objects you have > forgotten to log the memory for. > >>>> > >>>> > >>>> > >>>> I am running this job on Bluewater > >>>> I am using the 7 points FD stencil in 3D. > >>>> > >>>> I thought so on both counts. > >>>> > >>>> I apologize that I made a stupid mistake in computing the memory per > core. My settings render each core can access only 2G memory on average > instead of 8G which I mentioned in previous email. I re-run the job with 8G > memory per core on average and there is no "Out Of Memory" error. I would > do more test to see if there is still some memory issue. > >>>> > >>>> Ok. I'd still like to know where the memory was being used since my > estimates were off. > >>>> > >>>> > >>>> Thanks, > >>>> Dave > >>>> > >>>> Regards, > >>>> Frank > >>>> > >>>> > >>>> > >>>> On 07/11/2016 01:18 PM, Dave May wrote: > >>>>> Hi Frank, > >>>>> > >>>>> > >>>>> On 11 July 2016 at 19:14, frank <hengj...@uci.edu <javascript:;>> > wrote: > >>>>> Hi Dave, > >>>>> > >>>>> I re-run the test using bjacobi as the preconditioner on the coarse > mesh of telescope. The Grid is 3072*256*768 and process mesh is 96*8*24. > The petsc option file is attached. > >>>>> I still got the "Out Of Memory" error. The error occurred before the > linear solver finished one step. So I don't have the full info from > ksp_view. The info from ksp_view_pre is attached. > >>>>> > >>>>> Okay - that is essentially useless (sorry) > >>>>> > >>>>> It seems to me that the error occurred when the decomposition was > going to be changed. > >>>>> > >>>>> Based on what information? > >>>>> Running with -info would give us more clues, but will create a ton > of output. > >>>>> Please try running the case which failed with -info > >>>>> I had another test with a grid of 1536*128*384 and the same process > mesh as above. There was no error. The ksp_view info is attached for > comparison. > >>>>> Thank you. > >>>>> > >>>>> > >>>>> [3] Here is my crude estimate of your memory usage. > >>>>> I'll target the biggest memory hogs only to get an order of > magnitude estimate > >>>>> > >>>>> * The Fine grid operator contains 4223139840 non-zeros --> 1.8 GB > per MPI rank assuming double precision. > >>>>> The indices for the AIJ could amount to another 0.3 GB (assuming 32 > bit integers) > >>>>> > >>>>> * You use 5 levels of coarsening, so the other operators should > represent (collectively) > >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4 ~ 300 MB per MPI rank on the > communicator with 18432 ranks. > >>>>> The coarse grid should consume ~ 0.5 MB per MPI rank on the > communicator with 18432 ranks. > >>>>> > >>>>> * You use a reduction factor of 64, making the new communicator with > 288 MPI ranks. > >>>>> PCTelescope will first gather a temporary matrix associated with > your coarse level operator assuming a comm size of 288 living on the comm > with size 18432. > >>>>> This matrix will require approximately 0.5 * 64 = 32 MB per core on > the 288 ranks. > >>>>> This matrix is then used to form a new MPIAIJ matrix on the subcomm, > thus require another 32 MB per rank. > >>>>> The temporary matrix is now destroyed. > >>>>> > >>>>> * Because a DMDA is detected, a permutation matrix is assembled. > >>>>> This requires 2 doubles per point in the DMDA. > >>>>> Your coarse DMDA contains 92 x 16 x 48 points. > >>>>> Thus the permutation matrix will require < 1 MB per MPI rank on the > sub-comm. > >>>>> > >>>>> * Lastly, the matrix is permuted. This uses MatPtAP(), but the > resulting operator will have the same memory footprint as the unpermuted > matrix (32 MB). At any stage in PCTelescope, only 2 operators of size 32 MB > are held in memory when the DMDA is provided. > >>>>> > >>>>> From my rough estimates, the worst case memory foot print for any > given core, given your options is approximately > >>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB = 2465 MB > >>>>> This is way below 8 GB. > >>>>> > >>>>> Note this estimate completely ignores: > >>>>> (1) the memory required for the restriction operator, > >>>>> (2) the potential growth in the number of non-zeros per row due to > Galerkin coarsening (I wished -ksp_view_pre reported the output from > MatView so we could see the number of non-zeros required by the coarse > level operators) > >>>>> (3) all temporary vectors required by the CG solver, and those > required by the smoothers. > >>>>> (4) internal memory allocated by MatPtAP > >>>>> (5) memory associated with IS's used within PCTelescope > >>>>> > >>>>> So either I am completely off in my estimates, or you have not > carefully estimated the memory usage of your application code. Hopefully > others might examine/correct my rough estimates > >>>>> > >>>>> Since I don't have your code I cannot access the latter. > >>>>> Since I don't have access to the same machine you are running on, I > think we need to take a step back. > >>>>> > >>>>> [1] What machine are you running on? Send me a URL if its available > >>>>> > >>>>> [2] What discretization are you using? (I am guessing a scalar 7 > point FD stencil) > >>>>> If it's a 7 point FD stencil, we should be able to examine the > memory usage of your solver configuration using a standard, light weight > existing PETSc example, run on your machine at the same scale. > >>>>> This would hopefully enable us to correctly evaluate the actual > memory usage required by the solver configuration you are using. > >>>>> > >>>>> Thanks, > >>>>> Dave > >>>>> > >>>>> > >>>>> Frank > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On 07/08/2016 10:38 PM, Dave May wrote: > >>>>>> > >>>>>> On Saturday, 9 July 2016, frank <hengj...@uci.edu <javascript:;>> > wrote: > >>>>>> Hi Barry and Dave, > >>>>>> > >>>>>> Thank both of you for the advice. > >>>>>> > >>>>>> @Barry > >>>>>> I made a mistake in the file names in last email. I attached the > correct files this time. > >>>>>> For all the three tests, 'Telescope' is used as the coarse > preconditioner. > >>>>>> > >>>>>> == Test1: Grid: 1536*128*384, Process Mesh: 48*4*12 > >>>>>> Part of the memory usage: Vector 125 124 3971904 > 0. > >>>>>> Matrix 101 101 > 9462372 0 > >>>>>> > >>>>>> == Test2: Grid: 1536*128*384, Process Mesh: 96*8*24 > >>>>>> Part of the memory usage: Vector 125 124 681672 0. > >>>>>> Matrix 101 101 > 1462180 0. > >>>>>> > >>>>>> In theory, the memory usage in Test1 should be 8 times of Test2. In > my case, it is about 6 times. > >>>>>> > >>>>>> == Test3: Grid: 3072*256*768, Process Mesh: 96*8*24. Sub-domain > per process: 32*32*32 > >>>>>> Here I get the out of memory error. > >>>>>> > >>>>>> I tried to use -mg_coarse jacobi. In this way, I don't need to set > -mg_coarse_ksp_type and -mg_coarse_pc_type explicitly, right? > >>>>>> The linear solver didn't work in this case. Petsc output some > errors. > >>>>>> > >>>>>> @Dave > >>>>>> In test3, I use only one instance of 'Telescope'. On the coarse > mesh of 'Telescope', I used LU as the preconditioner instead of SVD. > >>>>>> If my set the levels correctly, then on the last coarse mesh of MG > where it calls 'Telescope', the sub-domain per process is 2*2*2. > >>>>>> On the last coarse mesh of 'Telescope', there is only one grid > point per process. > >>>>>> I still got the OOM error. The detailed petsc option file is > attached. > >>>>>> > >>>>>> Do you understand the expected memory usage for the particular > parallel LU implementation you are using? I don't (seriously). Replace LU > with bjacobi and re-run this test. My point about solver debugging is still > valid. > >>>>>> > >>>>>> And please send the result of KSPView so we can see what is > actually used in the computations > >>>>>> > >>>>>> Thanks > >>>>>> Dave > >>>>>> > >>>>>> > >>>>>> Thank you so much. > >>>>>> > >>>>>> Frank > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote: > >>>>>> On Jul 6, 2016, at 4:19 PM, frank <hengj...@uci.edu <javascript:;>> > wrote: > >>>>>> > >>>>>> Hi Barry, > >>>>>> > >>>>>> Thank you for you advice. > >>>>>> I tried three test. In the 1st test, the grid is 3072*256*768 and > the process mesh is 96*8*24. > >>>>>> The linear solver is 'cg' the preconditioner is 'mg' and > 'telescope' is used as the preconditioner at the coarse mesh. > >>>>>> The system gives me the "Out of Memory" error before the linear > system is completely solved. > >>>>>> The info from '-ksp_view_pre' is attached. I seems to me that the > error occurs when it reaches the coarse mesh. > >>>>>> > >>>>>> The 2nd test uses a grid of 1536*128*384 and process mesh is > 96*8*24. The 3rd test uses the > same grid but a different process mesh 48*4*12. > >>>>>> Are you sure this is right? The total matrix and vector memory > usage goes from 2nd test > >>>>>> Vector 384 383 8,193,712 0. > >>>>>> Matrix 103 103 11,508,688 0. > >>>>>> to 3rd test > >>>>>> Vector 384 383 1,590,520 0. > >>>>>> Matrix 103 103 3,508,664 0. > >>>>>> that is the memory usage got smaller but if you have only 1/8th the > processes and the same grid it should have gotten about 8 times bigger. Did > you maybe cut the grid by a factor of 8 also? If so that still doesn't > explain it because the memory usage changed by a factor of 5 something for > the vectors and 3 something for the matrices. > >>>>>> > >>>>>> > >>>>>> The linear solver and petsc options in 2nd and 3rd tests are the > same in 1st test. The linear solver works fine in both test. > >>>>>> I attached the memory usage of the 2nd and 3rd tests. The memory > info is from the option '-log_summary'. I tried to use '-momery_info' as > you suggested, but in my case petsc treated it as an unused option. It > output nothing about the memory. Do I need to add sth to my code so I can > use '-memory_info'? > >>>>>> Sorry, my mistake the option is -memory_view > >>>>>> > >>>>>> Can you run the one case with -memory_view and -mg_coarse jacobi > -ksp_max_it 1 (just so it doesn't iterate forever) to see how much memory > is used without the telescope? Also run case 2 the same way. > >>>>>> > >>>>>> Barry > >>>>>> > >>>>>> > >>>>>> > >>>>>> In both tests the memory usage is not large. > >>>>>> > >>>>>> It seems to me that it might be the 'telescope' preconditioner > that allocated a lot of memory and caused the error in the 1st test. > >>>>>> Is there is a way to show how much memory it allocated? > >>>>>> > >>>>>> Frank > >>>>>> > >>>>>> On 07/05/2016 03:37 PM, Barry Smith wrote: > >>>>>> Frank, > >>>>>> > >>>>>> You can run with -ksp_view_pre to have it "view" the KSP > before the solve so hopefully it gets that far. > >>>>>> > >>>>>> Please run the problem that does fit with -memory_info when > the problem completes it will show the "high water mark" for PETSc > allocated memory and total memory used. We first want to look at these > numbers to see if it is using more memory than you expect. You could also > run with say half the grid spacing to see how the memory usage scaled with > the increase in grid points. Make the runs also with -log_view and send all > the output from these options. > >>>>>> > >>>>>> Barry > >>>>>> > >>>>>> On Jul 5, 2016, at 5:23 PM, frank <hengj...@uci.edu <javascript:;>> > wrote: > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> I am using the CG ksp solver and Multigrid preconditioner to solve > a linear system in parallel. > >>>>>> I chose to use the 'Telescope' as the preconditioner on the coarse > mesh for its good performance. > >>>>>> The petsc options file is attached. > >>>>>> > >>>>>> The domain is a 3d box. > >>>>>> It works well when the grid is 1536*128*384 and the process mesh > is 96*8*24. When I double the size of grid and > keep the same process mesh and petsc options, I get an > "out of memory" error from the super-cluster I am using. > >>>>>> Each process has access to at least 8G memory, which should be more > than enough for my application. I am sure that all the other parts of my > code( except the linear solver ) do not use much memory. So I doubt if > there is something wrong with the linear solver. > >>>>>> The error occurs before the linear system is completely solved so I > don't have the info from ksp view. I am not able to re-produce the error > with a smaller problem either. > >>>>>> In addition, I tried to use the block jacobi as the preconditioner > with the same grid and same decomposition. The linear solver runs extremely > slow but there is no memory error. > >>>>>> > >>>>>> How can I diagnose what exactly cause the error? > >>>>>> Thank you so much. > >>>>>> > >>>>>> Frank > >>>>>> <petsc_options.txt> > >>>>>> <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt>< > petsc_options.txt> > >>>>>> > >>>>> > >>>> > >>> <ksp_view1.txt><ksp_view2.txt><ksp_view3.txt><memory1.txt>< > memory2.txt><petsc_options1.txt><petsc_options2.txt><petsc_options3.txt> > > > >