On Friday, 7 October 2016, frank <[email protected]> wrote: > Dear all, > > Thank you so much for the advice. > > All setup is done in the first solve. > > >> ** The time for 1st solve does not scale. >> In practice, I am solving a variable coefficient Poisson equation. I >> need to build the matrix every time step. Therefore, each step is similar >> to the 1st solve which does not scale. Is there a way I can improve the >> performance? >> > >> You could use rediscretization instead of Galerkin to produce the coarse >> operators. >> > > Yes I can think of one option for improved performance, but I cannot tell > whether it will be beneficial because the logging isn't sufficiently fine > grained (and there is no easy way to get the info out of petsc). > > I use PtAP to repartition the matrix, this could be consuming most of the > setup time in Telescope with your run. Such a repartitioning could be avoid > if you provided a method to create the operator on the coarse levels (what > Matt is suggesting). However, this requires you to be able to define your > coefficients on the coarse grid. This will most likely reduce setup time, > but your coarse grid operators (now re-discretized) are likely to be less > effective than those generated via Galerkin coarsening. > > > Please correct me if I understand this incorrectly: I can define my own > restriction function and pass it to petsc instead of using PtAP. > If so,what's the interface to do that? >
You need to provide your provide a method to KSPSetComputeOoerators to your outer KSP http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/KSP/KSPSetComputeOperators.html This method will get propagated through telescope to the KSP running in the sub-comm. Note that this functionality is currently not support for fortran. I need to make a small modification to telescope to enable fortran support. Thanks Dave > > > > Also, you use CG/MG when FMG by itself would probably be faster. Your >> smoother is likely not strong enough, and you >> should use something like V(2,2). There is a lot of tuning that is >> possible, but difficult to automate. >> > > Matt's completely correct. > If we could automate this in a meaningful manner, we would have done so. > > > I am not as familiar with multigrid as you guys. It would be very kind if > you could be more specific. > What does V(2,2) stand for? Is there some strong smoother build in petsc > that I can try? > > > Another thing, the vector assemble and scatter take more time as I > increased the cores#: > > cores# 4096 > 8192 16384 32768 65536 > VecAssemblyBegin 298 2.91E+00 2.87E+00 8.59E+00 > 2.75E+01 2.21E+03 > VecAssemblyEnd 298 3.37E-03 1.78E-03 1.78E-03 > 5.13E-03 1.99E-03 > VecScatterBegin 76303 3.82E+00 3.01E+00 2.54E+00 > 4.40E+00 1.32E+00 > VecScatterEnd 76303 3.09E+01 1.47E+01 2.23E+01 > 2.96E+01 2.10E+01 > > The above data is produced by solving a constant coefficients Possoin > equation with different rhs for 100 steps. > As you can see, the time of VecAssemblyBegin increase dramatically from > 32K cores to 65K. > With 65K cores, it took more time to assemble the rhs than solving the > equation. Is there a way to improve this? > > > Thank you. > > Regards, > Frank > > > > > > > > > > > > > > > > > >>> >>> >>> >>> >>> On 10/04/2016 12:56 PM, Dave May wrote: >>> >>> >>> >>> On Tuesday, 4 October 2016, frank <[email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>> >>>> Hi, >>>> This question is follow-up of the thread "Question about memory usage >>>> in Multigrid preconditioner". >>>> I used to have the "Out of Memory(OOM)" problem when using the >>>> CG+Telescope MG solver with 32768 cores. Adding the "-matrap 0; >>>> -matptap_scalable" option did solve that problem. >>>> >>>> Then I test the scalability by solving a 3d poisson eqn for 1 step. I >>>> used one sub-communicator in all the tests. The difference between the >>>> petsc options in those tests are: 1 the pc_telescope_reduction_factor; 2 >>>> the number of multigrid levels in the up/down solver. The function >>>> "ksp_solve" is timed. It is kind of slow and doesn't scale at all. >>>> >>>> Test1: 512^3 grid points >>>> Core# telescope_reduction_factor MG levels# for up/down >>>> solver Time for KSPSolve (s) >>>> 512 8 4 / >>>> 3 6.2466 >>>> 4096 64 5 / >>>> 3 0.9361 >>>> 32768 64 4 / >>>> 3 4.8914 >>>> >>>> Test2: 1024^3 grid points >>>> Core# telescope_reduction_factor MG levels# for up/down >>>> solver Time for KSPSolve (s) >>>> 4096 64 5 / 4 >>>> 3.4139 >>>> 8192 128 5 / >>>> 4 2.4196 >>>> 16384 32 5 / 3 >>>> 5.4150 >>>> 32768 64 5 / >>>> 3 5.6067 >>>> 65536 128 5 / >>>> 3 6.5219 >>>> >>> >>> You have to be very careful how you interpret these numbers. Your solver >>> contains nested calls to KSPSolve, and unfortunately as a result the >>> numbers you report include setup time. This will remain true even if you >>> call KSPSetUp on the outermost KSP. >>> >>> Your email concerns scalability of the silver application, so let's >>> focus on that issue. >>> >>> The only way to clearly separate setup from solve time is to perform two >>> identical solves. The second solve will not require any setup. You should >>> monitor the second solve via a new PetscStage. >>> >>> This was what I did in the telescope paper. It was the only way to >>> understand the setup cost (and scaling) cf the solve time (and scaling). >>> >>> Thanks >>> Dave >>> >>> >>> >>>> I guess I didn't set the MG levels properly. What would be the >>>> efficient way to arrange the MG levels? >>>> Also which preconditionr at the coarse mesh of the 2nd communicator >>>> should I use to improve the performance? >>>> >>>> I attached the test code and the petsc options file for the 1024^3 cube >>>> with 32768 cores. >>>> >>>> Thank you. >>>> >>>> Regards, >>>> Frank >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 09/15/2016 03:35 AM, Dave May wrote: >>>> >>>> HI all, >>>> >>>> I the only unexpected memory usage I can see is associated with the >>>> call to MatPtAP(). >>>> Here is something you can try immediately. >>>> Run your code with the additional options >>>> -matrap 0 -matptap_scalable >>>> >>>> I didn't realize this before, but the default behaviour of MatPtAP in >>>> parallel is actually to to explicitly form the transpose of P (e.g. >>>> assemble R = P^T) and then compute R.A.P. >>>> You don't want to do this. The option -matrap 0 resolves this issue. >>>> >>>> The implementation of P^T.A.P has two variants. >>>> The scalable implementation (with respect to memory usage) is selected >>>> via the second option -matptap_scalable. >>>> >>>> Try it out - I see a significant memory reduction using these options >>>> for particular mesh sizes / partitions. >>>> >>>> I've attached a cleaned up version of the code you sent me. >>>> There were a number of memory leaks and other issues. >>>> The main points being >>>> * You should call DMDAVecGetArrayF90() before VecAssembly{Begin,End} >>>> * You should call PetscFinalize(), otherwise the option -log_summary >>>> (-log_view) will not display anything once the program has completed. >>>> >>>> >>>> Thanks, >>>> Dave >>>> >>>> >>>> On 15 September 2016 at 08:03, Hengjie Wang <[email protected]> wrote: >>>> >>>>> Hi Dave, >>>>> >>>>> Sorry, I should have put more comment to explain the code. >>>>> The number of process in each dimension is the same: Px = Py=Pz=P. So >>>>> is the domain size. >>>>> So if the you want to run the code for a 512^3 grid points on 16^3 >>>>> cores, you need to set "-N 512 -P 16" in the command line. >>>>> I add more comments and also fix an error in the attached code. ( The >>>>> error only effects the accuracy of solution but not the memory usage. ) >>>>> >>>>> Thank you. >>>>> Frank >>>>> >>>>> >>>>> On 9/14/2016 9:05 PM, Dave May wrote: >>>>> >>>>> >>>>> >>>>> On Thursday, 15 September 2016, Dave May <[email protected]> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Thursday, 15 September 2016, frank <[email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I write a simple code to re-produce the error. I hope this can help >>>>>>> to diagnose the problem. >>>>>>> The code just solves a 3d poisson equation. >>>>>>> >>>>>> >>>>>> Why is the stencil width a runtime parameter?? And why is the default >>>>>> value 2? For 7-pnt FD Laplace, you only need a stencil width of 1. >>>>>> >>>>>> Was this choice made to mimic something in the real application code? >>>>>> >>>>> >>>>> Please ignore - I misunderstood your usage of the param set by -P >>>>> >>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> I run the code on a 1024^3 mesh. The process partition is 32 * 32 * >>>>>>> 32. That's when I re-produce the OOM error. Each core has about 2G >>>>>>> memory. >>>>>>> I also run the code on a 512^3 mesh with 16 * 16 * 16 processes. The >>>>>>> ksp solver works fine. >>>>>>> I attached the code, ksp_view_pre's output and my petsc option file. >>>>>>> >>>>>>> Thank you. >>>>>>> Frank >>>>>>> >>>>>>> On 09/09/2016 06:38 PM, Hengjie Wang wrote: >>>>>>> >>>>>>> Hi Barry, >>>>>>> >>>>>>> I checked. On the supercomputer, I had the option "-ksp_view_pre" >>>>>>> but it is not in file I sent you. I am sorry for the confusion. >>>>>>> >>>>>>> Regards, >>>>>>> Frank >>>>>>> >>>>>>> On Friday, September 9, 2016, Barry Smith <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> > On Sep 9, 2016, at 3:11 PM, frank <[email protected]> wrote: >>>>>>>> > >>>>>>>> > Hi Barry, >>>>>>>> > >>>>>>>> > I think the first KSP view output is from -ksp_view_pre. Before I >>>>>>>> submitted the test, I was not sure whether there would be OOM error or >>>>>>>> not. >>>>>>>> So I added both -ksp_view_pre and -ksp_view. >>>>>>>> >>>>>>>> But the options file you sent specifically does NOT list the >>>>>>>> -ksp_view_pre so how could it be from that? >>>>>>>> >>>>>>>> Sorry to be pedantic but I've spent too much time in the past >>>>>>>> trying to debug from incorrect information and want to make sure that >>>>>>>> the >>>>>>>> information I have is correct before thinking. Please recheck exactly >>>>>>>> what >>>>>>>> happened. Rerun with the exact input file you emailed if that is >>>>>>>> needed. >>>>>>>> >>>>>>>> Barry >>>>>>>> >>>>>>>> > >>>>>>>> > Frank >>>>>>>> > >>>>>>>> > >>>>>>>> > On 09/09/2016 12:38 PM, Barry Smith wrote: >>>>>>>> >> Why does ksp_view2.txt have two KSP views in it while >>>>>>>> ksp_view1.txt has only one KSPView in it? Did you run two different >>>>>>>> solves >>>>>>>> in the 2 case but not the one? >>>>>>>> >> >>>>>>>> >> Barry >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >>> On Sep 9, 2016, at 10:56 AM, frank <[email protected]> wrote: >>>>>>>> >>> >>>>>>>> >>> Hi, >>>>>>>> >>> >>>>>>>> >>> I want to continue digging into the memory problem here. >>>>>>>> >>> I did find a work around in the past, which is to use less >>>>>>>> cores per node so that each core has 8G memory. However this is >>>>>>>> deficient >>>>>>>> and expensive. I hope to locate the place that uses the most memory. >>>>>>>> >>> >>>>>>>> >>> Here is a brief summary of the tests I did in past: >>>>>>>> >>>> Test1: Mesh 1536*128*384 | Process Mesh 48*4*12 >>>>>>>> >>> Maximum (over computational time) process memory: >>>>>>>> total 7.0727e+08 >>>>>>>> >>> Current process memory: >>>>>>>> total 7.0727e+08 >>>>>>>> >>> Maximum (over computational time) space PetscMalloc()ed: total >>>>>>>> 6.3908e+11 >>>>>>>> >>> Current space PetscMalloc()ed: >>>>>>>> total 1.8275e+09 >>>>>>>> >>> >>>>>>>> >>>> Test2: Mesh 1536*128*384 | Process Mesh 96*8*24 >>>>>>>> >>> Maximum (over computational time) process memory: >>>>>>>> total 5.9431e+09 >>>>>>>> >>> Current process memory: >>>>>>>> total 5.9431e+09 >>>>>>>> >>> Maximum (over computational time) space PetscMalloc()ed: total >>>>>>>> 5.3202e+12 >>>>>>>> >>> Current space PetscMalloc()ed: >>>>>>>> total 5.4844e+09 >>>>>>>> >>> >>>>>>>> >>>> Test3: Mesh 3072*256*768 | Process Mesh 96*8*24 >>>>>>>> >>> OOM( Out Of Memory ) killer of the supercomputer terminated >>>>>>>> the job during "KSPSolve". >>>>>>>> >>> >>>>>>>> >>> I attached the output of ksp_view( the third test's output is >>>>>>>> from ksp_view_pre ), memory_view and also the petsc options. >>>>>>>> >>> >>>>>>>> >>> In all the tests, each core can access about 2G memory. In >>>>>>>> test3, there are 4223139840 non-zeros in the matrix. This will consume >>>>>>>> about 1.74M, using double precision. Considering some extra memory >>>>>>>> used to >>>>>>>> store integer index, 2G memory should still be way enough. >>>>>>>> >>> >>>>>>>> >>> Is there a way to find out which part of KSPSolve uses the most >>>>>>>> memory? >>>>>>>> >>> Thank you so much. >>>>>>>> >>> >>>>>>>> >>> BTW, there are 4 options remains unused and I don't understand >>>>>>>> why they are omitted: >>>>>>>> >>> -mg_coarse_telescope_mg_coarse_ksp_type value: preonly >>>>>>>> >>> -mg_coarse_telescope_mg_coarse_pc_type value: bjacobi >>>>>>>> >>> -mg_coarse_telescope_mg_levels_ksp_max_it value: 1 >>>>>>>> >>> -mg_coarse_telescope_mg_levels_ksp_type value: richardson >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> Regards, >>>>>>>> >>> Frank >>>>>>>> >>> >>>>>>>> >>> On 07/13/2016 05:47 PM, Dave May wrote: >>>>>>>> >>>> >>>>>>>> >>>> On 14 July 2016 at 01:07, frank <[email protected]> wrote: >>>>>>>> >>>> Hi Dave, >>>>>>>> >>>> >>>>>>>> >>>> Sorry for the late reply. >>>>>>>> >>>> Thank you so much for your detailed reply. >>>>>>>> >>>> >>>>>>>> >>>> I have a question about the estimation of the memory usage. >>>>>>>> There are 4223139840 allocated non-zeros and 18432 MPI processes. >>>>>>>> Double >>>>>>>> precision is used. So the memory per process is: >>>>>>>> >>>> 4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ? >>>>>>>> >>>> Did I do sth wrong here? Because this seems too small. >>>>>>>> >>>> >>>>>>>> >>>> No - I totally f***ed it up. You are correct. That'll teach me >>>>>>>> for fumbling around with my iphone calculator and not using my brain. >>>>>>>> (Note >>>>>>>> that to convert to MB just divide by 1e6, not 1024^2 - although I >>>>>>>> apparently cannot convert between units correctly....) >>>>>>>> >>>> >>>>>>>> >>>> From the PETSc objects associated with the solver, It looks >>>>>>>> like it _should_ run with 2GB per MPI rank. Sorry for my mistake. >>>>>>>> Possibilities are: somewhere in your usage of PETSc you've introduced a >>>>>>>> memory leak; PETSc is doing a huge over allocation (e.g. as per our >>>>>>>> discussion of MatPtAP); or in your application code there are other >>>>>>>> objects >>>>>>>> you have forgotten to log the memory for. >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> >>>> I am running this job on Bluewater >>>>>>>> >>>> I am using the 7 points FD stencil in 3D. >>>>>>>> >>>> >>>>>>>> >>>> I thought so on both counts. >>>>>>>> >>>> >>>>>>>> >>>> I apologize that I made a stupid mistake in computing the >>>>>>>> memory per core. My settings render each core can access only 2G >>>>>>>> memory on >>>>>>>> average instead of 8G which I mentioned in previous email. I re-run >>>>>>>> the job >>>>>>>> with 8G memory per core on average and there is no "Out Of Memory" >>>>>>>> error. I >>>>>>>> would do more test to see if there is still some memory issue. >>>>>>>> >>>> >>>>>>>> >>>> Ok. I'd still like to know where the memory was being used >>>>>>>> since my estimates were off. >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> >>>> Thanks, >>>>>>>> >>>> Dave >>>>>>>> >>>> >>>>>>>> >>>> Regards, >>>>>>>> >>>> Frank >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> >>>> On 07/11/2016 01:18 PM, Dave May wrote: >>>>>>>> >>>>> Hi Frank, >>>>>>>> >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> On 11 July 2016 at 19:14, frank <[email protected]> wrote: >>>>>>>> >>>>> Hi Dave, >>>>>>>> >>>>> >>>>>>>> >>>>> I re-run the test using bjacobi as the preconditioner on the >>>>>>>> coarse mesh of telescope. The Grid is 3072*256*768 and process mesh is >>>>>>>> 96*8*24. The petsc option file is attached. >>>>>>>> >>>>> I still got the "Out Of Memory" error. The error occurred >>>>>>>> before the linear solver finished one step. So I don't have the full >>>>>>>> info >>>>>>>> from ksp_view. The info from ksp_view_pre is attached. >>>>>>>> >>>>> >>>>>>>> >>>>> Okay - that is essentially useless (sorry) >>>>>>>> >>>>> >>>>>>>> >>>>> It seems to me that the error occurred when the decomposition >>>>>>>> was going to be changed. >>>>>>>> >>>>> >>>>>>>> >>>>> Based on what information? >>>>>>>> >>>>> Running with -info would give us more clues, but will create >>>>>>>> a ton of output. >>>>>>>> >>>>> Please try running the case which failed with -info >>>>>>>> >>>>> I had another test with a grid of 1536*128*384 and the same >>>>>>>> process mesh as above. There was no error. The ksp_view info is >>>>>>>> attached >>>>>>>> for comparison. >>>>>>>> >>>>> Thank you. >>>>>>>> >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> [3] Here is my crude estimate of your memory usage. >>>>>>>> >>>>> I'll target the biggest memory hogs only to get an order of >>>>>>>> magnitude estimate >>>>>>>> >>>>> >>>>>>>> >>>>> * The Fine grid operator contains 4223139840 non-zeros --> >>>>>>>> 1.8 GB per MPI rank assuming double precision. >>>>>>>> >>>>> The indices for the AIJ could amount to another 0.3 GB >>>>>>>> (assuming 32 bit integers) >>>>>>>> >>>>> >>>>>>>> >>>>> * You use 5 levels of coarsening, so the other operators >>>>>>>> should represent (collectively) >>>>>>>> >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4 ~ 300 MB per MPI rank >>>>>>>> on the communicator with 18432 ranks. >>>>>>>> >>>>> The coarse grid should consume ~ 0.5 MB per MPI rank on the >>>>>>>> communicator with 18432 ranks. >>>>>>>> >>>>> >>>>>>>> >>>>> * You use a reduction factor of 64, making the new >>>>>>>> communicator with 288 MPI ranks. >>>>>>>> >>>>> PCTelescope will first gather a temporary matrix associated >>>>>>>> with your coarse level operator assuming a comm size of 288 living on >>>>>>>> the >>>>>>>> comm with size 18432. >>>>>>>> >>>>> This matrix will require approximately 0.5 * 64 = 32 MB per >>>>>>>> core on the 288 ranks. >>>>>>>> >>>>> This matrix is then used to form a new MPIAIJ matrix on the >>>>>>>> subcomm, thus require another 32 MB per rank. >>>>>>>> >>>>> The temporary matrix is now destroyed. >>>>>>>> >>>>> >>>>>>>> >>>>> * Because a DMDA is detected, a permutation matrix is >>>>>>>> assembled. >>>>>>>> >>>>> This requires 2 doubles per point in the DMDA. >>>>>>>> >>>>> Your coarse DMDA contains 92 x 16 x 48 points. >>>>>>>> >>>>> Thus the permutation matrix will require < 1 MB per MPI rank >>>>>>>> on the sub-comm. >>>>>>>> >>>>> >>>>>>>> >>>>> * Lastly, the matrix is permuted. This uses MatPtAP(), but >>>>>>>> the resulting operator will have the same memory footprint as the >>>>>>>> unpermuted matrix (32 MB). At any stage in PCTelescope, only 2 >>>>>>>> operators of >>>>>>>> size 32 MB are held in memory when the DMDA is provided. >>>>>>>> >>>>> >>>>>>>> >>>>> From my rough estimates, the worst case memory foot print for >>>>>>>> any given core, given your options is approximately >>>>>>>> >>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB = 2465 MB >>>>>>>> >>>>> This is way below 8 GB. >>>>>>>> >>>>> >>>>>>>> >>>>> Note this estimate completely ignores: >>>>>>>> >>>>> (1) the memory required for the restriction operator, >>>>>>>> >>>>> (2) the potential growth in the number of non-zeros per row >>>>>>>> due to Galerkin coarsening (I wished -ksp_view_pre reported the output >>>>>>>> from >>>>>>>> MatView so we could see the number of non-zeros required by the coarse >>>>>>>> level operators) >>>>>>>> >>>>> (3) all temporary vectors required by the CG solver, and >>>>>>>> those required by the smoothers. >>>>>>>> >>>>> (4) internal memory allocated by MatPtAP >>>>>>>> >>>>> (5) memory associated with IS's used within PCTelescope >>>>>>>> >>>>> >>>>>>>> >>>>> So either I am completely off in my estimates, or you have >>>>>>>> not carefully estimated the memory usage of your application code. >>>>>>>> Hopefully others might examine/correct my rough estimates >>>>>>>> >>>>> >>>>>>>> >>>>> Since I don't have your code I cannot access the latter. >>>>>>>> >>>>> Since I don't have access to the same machine you are running >>>>>>>> on, I think we need to take a step back. >>>>>>>> >>>>> >>>>>>>> >>>>> [1] What machine are you running on? Send me a URL if its >>>>>>>> available >>>>>>>> >>>>> >>>>>>>> >>>>> [2] What discretization are you using? (I am guessing a >>>>>>>> scalar 7 point FD stencil) >>>>>>>> >>>>> If it's a 7 point FD stencil, we should be able to examine >>>>>>>> the memory usage of your solver configuration using a standard, light >>>>>>>> weight existing PETSc example, run on your machine at the same scale. >>>>>>>> >>>>> This would hopefully enable us to correctly evaluate the >>>>>>>> actual memory usage required by the solver configuration you are using. >>>>>>>> >>>>> >>>>>>>> >>>>> Thanks, >>>>>>>> >>>>> Dave >>>>>>>> >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> Frank >>>>>>>> >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> On 07/08/2016 10:38 PM, Dave May wrote: >>>>>>>> >>>>>> >>>>>>>> >>>>>> On Saturday, 9 July 2016, frank <[email protected]> wrote: >>>>>>>> >>>>>> Hi Barry and Dave, >>>>>>>> >>>>>> >>>>>>>> >>>>>> Thank both of you for the advice. >>>>>>>> >>>>>> >>>>>>>> >>>>>> @Barry >>>>>>>> >>>>>> I made a mistake in the file names in last email. I attached >>>>>>>> the correct files this time. >>>>>>>> >>>>>> For all the three tests, 'Telescope' is used as the coarse >>>>>>>> preconditioner. >>>>>>>> >>>>>> >>>>>>>> >>>>>> == Test1: Grid: 1536*128*384, Process Mesh: 48*4*12 >>>>>>>> >>>>>> Part of the memory usage: Vector 125 124 >>>>>>>> 3971904 0. >>>>>>>> >>>>>> Matrix 101 >>>>>>>> 101 9462372 0 >>>>>>>> >>>>>> >>>>>>>> >>>>>> == Test2: Grid: 1536*128*384, Process Mesh: 96*8*24 >>>>>>>> >>>>>> Part of the memory usage: Vector 125 124 >>>>>>>> 681672 0. >>>>>>>> >>>>>> Matrix 101 >>>>>>>> 101 1462180 0. >>>>>>>> >>>>>> >>>>>>>> >>>>>> In theory, the memory usage in Test1 should be 8 times of >>>>>>>> Test2. In my case, it is about 6 times. >>>>>>>> >>>>>> >>>>>>>> >>>>>> == Test3: Grid: 3072*256*768, Process Mesh: 96*8*24. >>>>>>>> Sub-domain per process: 32*32*32 >>>>>>>> >>>>>> Here I get the out of memory error. >>>>>>>> >>>>>> >>>>>>>> >>>>>> I tried to use -mg_coarse jacobi. In this way, I don't need >>>>>>>> to set -mg_coarse_ksp_type and -mg_coarse_pc_type explicitly, right? >>>>>>>> >>>>>> The linear solver didn't work in this case. Petsc output >>>>>>>> some errors. >>>>>>>> >>>>>> >>>>>>>> >>>>>> @Dave >>>>>>>> >>>>>> In test3, I use only one instance of 'Telescope'. On the >>>>>>>> coarse mesh of 'Telescope', I used LU as the preconditioner instead of >>>>>>>> SVD. >>>>>>>> >>>>>> If my set the levels correctly, then on the last coarse mesh >>>>>>>> of MG where it calls 'Telescope', the sub-domain per process is 2*2*2. >>>>>>>> >>>>>> On the last coarse mesh of 'Telescope', there is only one >>>>>>>> grid point per process. >>>>>>>> >>>>>> I still got the OOM error. The detailed petsc option file is >>>>>>>> attached. >>>>>>>> >>>>>> >>>>>>>> >>>>>> Do you understand the expected memory usage for the >>>>>>>> particular parallel LU implementation you are using? I don't >>>>>>>> (seriously). >>>>>>>> Replace LU with bjacobi and re-run this test. My point about solver >>>>>>>> debugging is still valid. >>>>>>>> >>>>>> >>>>>>>> >>>>>> And please send the result of KSPView so we can see what is >>>>>>>> actually used in the computations >>>>>>>> >>>>>> >>>>>>>> >>>>>> Thanks >>>>>>>> >>>>>> Dave >>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> Thank you so much. >>>>>>>> >>>>>> >>>>>>>> >>>>>> Frank >>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote: >>>>>>>> >>>>>> On Jul 6, 2016, at 4:19 PM, frank <[email protected]> wrote: >>>>>>>> >>>>>> >>>>>>>> >>>>>> Hi Barry, >>>>>>>> >>>>>> >>>>>>>> >>>>>> Thank you for you advice. >>>>>>>> >>>>>> I tried three test. In the 1st test, the grid is >>>>>>>> 3072*256*768 and the process mesh is 96*8*24. >>>>>>>> >>>>>> The linear solver is 'cg' the preconditioner is 'mg' and >>>>>>>> 'telescope' is used as the preconditioner at the coarse mesh. >>>>>>>> >>>>>> The system gives me the "Out of Memory" error before the >>>>>>>> linear system is completely solved. >>>>>>>> >>>>>> The info from '-ksp_view_pre' is attached. I seems to me >>>>>>>> that the error occurs when it reaches the coarse mesh. >>>>>>>> >>>>>> >>>>>>>> >>>>>> The 2nd test uses a grid of 1536*128*384 and process mesh is >>>>>>>> 96*8*24. The 3rd test uses >>>>>>>> the >>>>>>>> same grid but a different process mesh 48*4*12. >>>>>>>> >>>>>> Are you sure this is right? The total matrix and vector >>>>>>>> memory usage goes from 2nd test >>>>>>>> >>>>>> Vector 384 383 8,193,712 >>>>>>>> 0. >>>>>>>> >>>>>> Matrix 103 103 11,508,688 >>>>>>>> 0. >>>>>>>> >>>>>> to 3rd test >>>>>>>> >>>>>> Vector 384 383 1,590,520 >>>>>>>> 0. >>>>>>>> >>>>>> Matrix 103 103 3,508,664 >>>>>>>> 0. >>>>>>>> >>>>>> that is the memory usage got smaller but if you have only >>>>>>>> 1/8th the processes and the same grid it should have gotten about 8 >>>>>>>> times >>>>>>>> bigger. Did you maybe cut the grid by a factor of 8 also? If so that >>>>>>>> still >>>>>>>> doesn't explain it because the memory usage changed by a factor of 5 >>>>>>>> something for the vectors and 3 something for the matrices. >>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> The linear solver and petsc options in 2nd and 3rd tests are >>>>>>>> the same in 1st test. The linear solver works fine in both test. >>>>>>>> >>>>>> I attached the memory usage of the 2nd and 3rd tests. The >>>>>>>> memory info is from the option '-log_summary'. I tried to use >>>>>>>> '-momery_info' as you suggested, but in my case petsc treated it as an >>>>>>>> unused option. It output nothing about the memory. Do I need to add >>>>>>>> sth to >>>>>>>> my code so I can use '-memory_info'? >>>>>>>> >>>>>> Sorry, my mistake the option is -memory_view >>>>>>>> >>>>>> >>>>>>>> >>>>>> Can you run the one case with -memory_view and -mg_coarse >>>>>>>> jacobi -ksp_max_it 1 (just so it doesn't iterate forever) to >>>>>>> >>>>>>>
