Re: [petsc-users] Performance of the Telescope Multigrid Preconditioner

Dave May Fri, 07 Oct 2016 21:52:38 -0700

On Friday, 7 October 2016, frank <[email protected]> wrote:

> Dear all,
>
> Thank you so much for the advice.
>
> All setup is done in the first solve.
>
>
>> ** The time for 1st solve does not scale.
>>     In practice, I am solving a variable coefficient  Poisson equation. I
>> need to build the matrix every time step. Therefore, each step is similar
>> to the 1st solve which does not scale. Is there a way I can improve the
>> performance?
>>
>
>> You could use rediscretization instead of Galerkin to produce the coarse
>> operators.
>>
>
> Yes I can think of one option for improved performance, but I cannot tell
> whether it will be beneficial because the logging isn't sufficiently fine
> grained (and there is no easy way to get the info out of petsc).
>
> I use PtAP to repartition the matrix, this could be consuming most of the
> setup time in Telescope with your run. Such a repartitioning could be avoid
> if you provided a method to create the operator on the coarse levels (what
> Matt is suggesting). However, this requires you to be able to define your
> coefficients on the coarse grid. This will most likely reduce setup time,
> but your coarse grid operators (now re-discretized) are likely to be less
> effective than those generated via Galerkin coarsening.
>
>
> Please correct me if I understand this incorrectly:   I can define my own
> restriction function and pass it to petsc instead of using PtAP.
> If so,what's the interface to do that?
>


You need to provide your provide a method to KSPSetComputeOoerators to your
outer KSP

http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/KSP/KSPSetComputeOperators.html

This method will get propagated through telescope to the KSP running in the
sub-comm.

Note that this functionality is currently not support for fortran. I need
to make a small modification to telescope to enable fortran support.

Thanks
  Dave


>
>
>
> Also, you use CG/MG when FMG by itself would probably be faster. Your
>> smoother is likely not strong enough, and you
>> should use something like V(2,2). There is a lot of tuning that is
>> possible, but difficult to automate.
>>
>
> Matt's completely correct.
> If we could automate this in a meaningful manner, we would have done so.
>
>
> I am not as familiar with multigrid as you guys. It would be very kind if
> you could be more specific.
> What does V(2,2) stand for? Is there some strong smoother build in petsc
> that I can try?
>
>
> Another thing, the vector assemble and scatter take more time as I
> increased the cores#:
>
>  cores#                                       4096
> 8192          16384         32768          65536
> VecAssemblyBegin       298        2.91E+00    2.87E+00    8.59E+00
> 2.75E+01    2.21E+03
> VecAssemblyEnd          298        3.37E-03    1.78E-03    1.78E-03
> 5.13E-03    1.99E-03
> VecScatterBegin           76303    3.82E+00    3.01E+00    2.54E+00
> 4.40E+00    1.32E+00
> VecScatterEnd              76303    3.09E+01    1.47E+01    2.23E+01
> 2.96E+01    2.10E+01
>
> The above data is produced by solving a constant coefficients Possoin
> equation with different rhs for 100 steps.
> As you can see, the time of VecAssemblyBegin increase dramatically from
> 32K cores to 65K.
> With 65K cores, it took more time to assemble the rhs than solving the
> equation.   Is there a way to improve this?
>
>
> Thank you.
>
> Regards,
> Frank
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>>>
>>>
>>>
>>>
>>> On 10/04/2016 12:56 PM, Dave May wrote:
>>>
>>>
>>>
>>> On Tuesday, 4 October 2016, frank <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>
>>>> Hi,
>>>> This question is follow-up of the thread "Question about memory usage
>>>> in Multigrid preconditioner".
>>>> I used to have the "Out of Memory(OOM)" problem when using the
>>>> CG+Telescope MG solver with 32768 cores. Adding the "-matrap 0;
>>>> -matptap_scalable" option did solve that problem.
>>>>
>>>> Then I test the scalability by solving a 3d poisson eqn for 1 step. I
>>>> used one sub-communicator in all the tests. The difference between the
>>>> petsc options in those tests are: 1 the pc_telescope_reduction_factor; 2
>>>> the number of multigrid levels in the up/down solver. The function
>>>> "ksp_solve" is timed. It is kind of slow and doesn't scale at all.
>>>>
>>>> Test1: 512^3 grid points
>>>> Core#        telescope_reduction_factor        MG levels# for up/down
>>>> solver     Time for KSPSolve (s)
>>>> 512             8                                                 4 /
>>>> 3                                              6.2466
>>>> 4096           64                                               5 /
>>>> 3                                              0.9361
>>>> 32768         64                                               4 /
>>>> 3                                              4.8914
>>>>
>>>> Test2: 1024^3 grid points
>>>> Core#        telescope_reduction_factor        MG levels# for up/down
>>>> solver     Time for KSPSolve (s)
>>>> 4096           64                                               5 / 4
>>>>                                              3.4139
>>>> 8192           128                                             5 /
>>>> 4                                              2.4196
>>>> 16384         32                                               5 / 3
>>>>                                              5.4150
>>>> 32768         64                                               5 /
>>>> 3                                              5.6067
>>>> 65536         128                                             5 /
>>>> 3                                              6.5219
>>>>
>>>
>>> You have to be very careful how you interpret these numbers. Your solver
>>> contains nested calls to KSPSolve, and unfortunately as a result the
>>> numbers you report include setup time. This will remain true even if you
>>> call KSPSetUp on the outermost KSP.
>>>
>>> Your email concerns scalability of the silver application, so let's
>>> focus on that issue.
>>>
>>> The only way to clearly separate setup from solve time is to perform two
>>> identical solves. The second solve will not require any setup. You should
>>> monitor the second solve via a new PetscStage.
>>>
>>> This was what I did in the telescope paper. It was the only way to
>>> understand the setup cost (and scaling) cf the solve time (and scaling).
>>>
>>> Thanks
>>>   Dave
>>>
>>>
>>>
>>>> I guess I didn't set the MG levels properly. What would be the
>>>> efficient way to arrange the MG levels?
>>>> Also which preconditionr at the coarse mesh of the 2nd communicator
>>>> should I use to improve the performance?
>>>>
>>>> I attached the test code and the petsc options file for the 1024^3 cube
>>>> with 32768 cores.
>>>>
>>>> Thank you.
>>>>
>>>> Regards,
>>>> Frank
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 09/15/2016 03:35 AM, Dave May wrote:
>>>>
>>>> HI all,
>>>>
>>>> I the only unexpected memory usage I can see is associated with the
>>>> call to MatPtAP().
>>>> Here is something you can try immediately.
>>>> Run your code with the additional options
>>>>   -matrap 0 -matptap_scalable
>>>>
>>>> I didn't realize this before, but the default behaviour of MatPtAP in
>>>> parallel is actually to to explicitly form the transpose of P (e.g.
>>>> assemble R = P^T) and then compute R.A.P.
>>>> You don't want to do this. The option -matrap 0 resolves this issue.
>>>>
>>>> The implementation of P^T.A.P has two variants.
>>>> The scalable implementation (with respect to memory usage) is selected
>>>> via the second option -matptap_scalable.
>>>>
>>>> Try it out - I see a significant memory reduction using these options
>>>> for particular mesh sizes / partitions.
>>>>
>>>> I've attached a cleaned up version of the code you sent me.
>>>> There were a number of memory leaks and other issues.
>>>> The main points being
>>>>   * You should call DMDAVecGetArrayF90() before VecAssembly{Begin,End}
>>>>   * You should call PetscFinalize(), otherwise the option -log_summary
>>>> (-log_view) will not display anything once the program has completed.
>>>>
>>>>
>>>> Thanks,
>>>>   Dave
>>>>
>>>>
>>>> On 15 September 2016 at 08:03, Hengjie Wang <[email protected]> wrote:
>>>>
>>>>> Hi Dave,
>>>>>
>>>>> Sorry, I should have put more comment to explain the code.
>>>>> The number of process in each dimension is the same: Px = Py=Pz=P. So
>>>>> is the domain size.
>>>>> So if the you want to run the code for a  512^3 grid points on 16^3
>>>>> cores, you need to set "-N 512 -P 16" in the command line.
>>>>> I add more comments and also fix an error in the attached code. ( The
>>>>> error only effects the accuracy of solution but not the memory usage. )
>>>>>
>>>>> Thank you.
>>>>> Frank
>>>>>
>>>>>
>>>>> On 9/14/2016 9:05 PM, Dave May wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Thursday, 15 September 2016, Dave May <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Thursday, 15 September 2016, frank <[email protected]> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I write a simple code to re-produce the error. I hope this can help
>>>>>>> to diagnose the problem.
>>>>>>> The code just solves a 3d poisson equation.
>>>>>>>
>>>>>>
>>>>>> Why is the stencil width a runtime parameter?? And why is the default
>>>>>> value 2? For 7-pnt FD Laplace, you only need a stencil width of 1.
>>>>>>
>>>>>> Was this choice made to mimic something in the real application code?
>>>>>>
>>>>>
>>>>> Please ignore - I misunderstood your usage of the param set by -P
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I run the code on a 1024^3 mesh. The process partition is 32 * 32 *
>>>>>>> 32. That's when I re-produce the OOM error. Each core has about 2G 
>>>>>>> memory.
>>>>>>> I also run the code on a 512^3 mesh with 16 * 16 * 16 processes. The
>>>>>>> ksp solver works fine.
>>>>>>> I attached the code, ksp_view_pre's output and my petsc option file.
>>>>>>>
>>>>>>> Thank you.
>>>>>>> Frank
>>>>>>>
>>>>>>> On 09/09/2016 06:38 PM, Hengjie Wang wrote:
>>>>>>>
>>>>>>> Hi Barry,
>>>>>>>
>>>>>>> I checked. On the supercomputer, I had the option "-ksp_view_pre"
>>>>>>> but it is not in file I sent you. I am sorry for the confusion.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Frank
>>>>>>>
>>>>>>> On Friday, September 9, 2016, Barry Smith <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> > On Sep 9, 2016, at 3:11 PM, frank <[email protected]> wrote:
>>>>>>>> >
>>>>>>>> > Hi Barry,
>>>>>>>> >
>>>>>>>> > I think the first KSP view output is from -ksp_view_pre. Before I
>>>>>>>> submitted the test, I was not sure whether there would be OOM error or 
>>>>>>>> not.
>>>>>>>> So I added both -ksp_view_pre and -ksp_view.
>>>>>>>>
>>>>>>>>   But the options file you sent specifically does NOT list the
>>>>>>>> -ksp_view_pre so how could it be from that?
>>>>>>>>
>>>>>>>>    Sorry to be pedantic but I've spent too much time in the past
>>>>>>>> trying to debug from incorrect information and want to make sure that 
>>>>>>>> the
>>>>>>>> information I have is correct before thinking. Please recheck exactly 
>>>>>>>> what
>>>>>>>> happened. Rerun with the exact input file you emailed if that is 
>>>>>>>> needed.
>>>>>>>>
>>>>>>>>    Barry
>>>>>>>>
>>>>>>>> >
>>>>>>>> > Frank
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On 09/09/2016 12:38 PM, Barry Smith wrote:
>>>>>>>> >>   Why does ksp_view2.txt have two KSP views in it while
>>>>>>>> ksp_view1.txt has only one KSPView in it? Did you run two different 
>>>>>>>> solves
>>>>>>>> in the 2 case but not the one?
>>>>>>>> >>
>>>>>>>> >>   Barry
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>> On Sep 9, 2016, at 10:56 AM, frank <[email protected]> wrote:
>>>>>>>> >>>
>>>>>>>> >>> Hi,
>>>>>>>> >>>
>>>>>>>> >>> I want to continue digging into the memory problem here.
>>>>>>>> >>> I did find a work around in the past, which is to use less
>>>>>>>> cores per node so that each core has 8G memory. However this is 
>>>>>>>> deficient
>>>>>>>> and expensive. I hope to locate the place that uses the most memory.
>>>>>>>> >>>
>>>>>>>> >>> Here is a brief summary of the tests I did in past:
>>>>>>>> >>>> Test1:   Mesh 1536*128*384  |  Process Mesh 48*4*12
>>>>>>>> >>> Maximum (over computational time) process memory:
>>>>>>>>  total 7.0727e+08
>>>>>>>> >>> Current process memory:
>>>>>>>>                  total 7.0727e+08
>>>>>>>> >>> Maximum (over computational time) space PetscMalloc()ed:  total
>>>>>>>> 6.3908e+11
>>>>>>>> >>> Current space PetscMalloc()ed:
>>>>>>>>               total 1.8275e+09
>>>>>>>> >>>
>>>>>>>> >>>> Test2:    Mesh 1536*128*384  |  Process Mesh 96*8*24
>>>>>>>> >>> Maximum (over computational time) process memory:
>>>>>>>>  total 5.9431e+09
>>>>>>>> >>> Current process memory:
>>>>>>>>                  total 5.9431e+09
>>>>>>>> >>> Maximum (over computational time) space PetscMalloc()ed:  total
>>>>>>>> 5.3202e+12
>>>>>>>> >>> Current space PetscMalloc()ed:
>>>>>>>>                total 5.4844e+09
>>>>>>>> >>>
>>>>>>>> >>>> Test3:    Mesh 3072*256*768  |  Process Mesh 96*8*24
>>>>>>>> >>>     OOM( Out Of Memory ) killer of the supercomputer terminated
>>>>>>>> the job during "KSPSolve".
>>>>>>>> >>>
>>>>>>>> >>> I attached the output of ksp_view( the third test's output is
>>>>>>>> from ksp_view_pre ), memory_view and also the petsc options.
>>>>>>>> >>>
>>>>>>>> >>> In all the tests, each core can access about 2G memory. In
>>>>>>>> test3, there are 4223139840 non-zeros in the matrix. This will consume
>>>>>>>> about 1.74M, using double precision. Considering some extra memory 
>>>>>>>> used to
>>>>>>>> store integer index, 2G memory should still be way enough.
>>>>>>>> >>>
>>>>>>>> >>> Is there a way to find out which part of KSPSolve uses the most
>>>>>>>> memory?
>>>>>>>> >>> Thank you so much.
>>>>>>>> >>>
>>>>>>>> >>> BTW, there are 4 options remains unused and I don't understand
>>>>>>>> why they are omitted:
>>>>>>>> >>> -mg_coarse_telescope_mg_coarse_ksp_type value: preonly
>>>>>>>> >>> -mg_coarse_telescope_mg_coarse_pc_type value: bjacobi
>>>>>>>> >>> -mg_coarse_telescope_mg_levels_ksp_max_it value: 1
>>>>>>>> >>> -mg_coarse_telescope_mg_levels_ksp_type value: richardson
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>> Regards,
>>>>>>>> >>> Frank
>>>>>>>> >>>
>>>>>>>> >>> On 07/13/2016 05:47 PM, Dave May wrote:
>>>>>>>> >>>>
>>>>>>>> >>>> On 14 July 2016 at 01:07, frank <[email protected]> wrote:
>>>>>>>> >>>> Hi Dave,
>>>>>>>> >>>>
>>>>>>>> >>>> Sorry for the late reply.
>>>>>>>> >>>> Thank you so much for your detailed reply.
>>>>>>>> >>>>
>>>>>>>> >>>> I have a question about the estimation of the memory usage.
>>>>>>>> There are 4223139840 allocated non-zeros and 18432 MPI processes. 
>>>>>>>> Double
>>>>>>>> precision is used. So the memory per process is:
>>>>>>>> >>>>   4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ?
>>>>>>>> >>>> Did I do sth wrong here? Because this seems too small.
>>>>>>>> >>>>
>>>>>>>> >>>> No - I totally f***ed it up. You are correct. That'll teach me
>>>>>>>> for fumbling around with my iphone calculator and not using my brain. 
>>>>>>>> (Note
>>>>>>>> that to convert to MB just divide by 1e6, not 1024^2 - although I
>>>>>>>> apparently cannot convert between units correctly....)
>>>>>>>> >>>>
>>>>>>>> >>>> From the PETSc objects associated with the solver, It looks
>>>>>>>> like it _should_ run with 2GB per MPI rank. Sorry for my mistake.
>>>>>>>> Possibilities are: somewhere in your usage of PETSc you've introduced a
>>>>>>>> memory leak; PETSc is doing a huge over allocation (e.g. as per our
>>>>>>>> discussion of MatPtAP); or in your application code there are other 
>>>>>>>> objects
>>>>>>>> you have forgotten to log the memory for.
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>> I am running this job on Bluewater
>>>>>>>> >>>> I am using the 7 points FD stencil in 3D.
>>>>>>>> >>>>
>>>>>>>> >>>> I thought so on both counts.
>>>>>>>> >>>>
>>>>>>>> >>>> I apologize that I made a stupid mistake in computing the
>>>>>>>> memory per core. My settings render each core can access only 2G 
>>>>>>>> memory on
>>>>>>>> average instead of 8G which I mentioned in previous email. I re-run 
>>>>>>>> the job
>>>>>>>> with 8G memory per core on average and there is no "Out Of Memory" 
>>>>>>>> error. I
>>>>>>>> would do more test to see if there is still some memory issue.
>>>>>>>> >>>>
>>>>>>>> >>>> Ok. I'd still like to know where the memory was being used
>>>>>>>> since my estimates were off.
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>> Thanks,
>>>>>>>> >>>>   Dave
>>>>>>>> >>>>
>>>>>>>> >>>> Regards,
>>>>>>>> >>>> Frank
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>> On 07/11/2016 01:18 PM, Dave May wrote:
>>>>>>>> >>>>> Hi Frank,
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>>> On 11 July 2016 at 19:14, frank <[email protected]> wrote:
>>>>>>>> >>>>> Hi Dave,
>>>>>>>> >>>>>
>>>>>>>> >>>>> I re-run the test using bjacobi as the preconditioner on the
>>>>>>>> coarse mesh of telescope. The Grid is 3072*256*768 and process mesh is
>>>>>>>> 96*8*24. The petsc option file is attached.
>>>>>>>> >>>>> I still got the "Out Of Memory" error. The error occurred
>>>>>>>> before the linear solver finished one step. So I don't have the full 
>>>>>>>> info
>>>>>>>> from ksp_view. The info from ksp_view_pre is attached.
>>>>>>>> >>>>>
>>>>>>>> >>>>> Okay - that is essentially useless (sorry)
>>>>>>>> >>>>>
>>>>>>>> >>>>> It seems to me that the error occurred when the decomposition
>>>>>>>> was going to be changed.
>>>>>>>> >>>>>
>>>>>>>> >>>>> Based on what information?
>>>>>>>> >>>>> Running with -info would give us more clues, but will create
>>>>>>>> a ton of output.
>>>>>>>> >>>>> Please try running the case which failed with -info
>>>>>>>> >>>>>  I had another test with a grid of 1536*128*384 and the same
>>>>>>>> process mesh as above. There was no error. The ksp_view info is 
>>>>>>>> attached
>>>>>>>> for comparison.
>>>>>>>> >>>>> Thank you.
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>>> [3] Here is my crude estimate of your memory usage.
>>>>>>>> >>>>> I'll target the biggest memory hogs only to get an order of
>>>>>>>> magnitude estimate
>>>>>>>> >>>>>
>>>>>>>> >>>>> * The Fine grid operator contains 4223139840 non-zeros -->
>>>>>>>> 1.8 GB per MPI rank assuming double precision.
>>>>>>>> >>>>> The indices for the AIJ could amount to another 0.3 GB
>>>>>>>> (assuming 32 bit integers)
>>>>>>>> >>>>>
>>>>>>>> >>>>> * You use 5 levels of coarsening, so the other operators
>>>>>>>> should represent (collectively)
>>>>>>>> >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~ 300 MB per MPI rank
>>>>>>>> on the communicator with 18432 ranks.
>>>>>>>> >>>>> The coarse grid should consume ~ 0.5 MB per MPI rank on the
>>>>>>>> communicator with 18432 ranks.
>>>>>>>> >>>>>
>>>>>>>> >>>>> * You use a reduction factor of 64, making the new
>>>>>>>> communicator with 288 MPI ranks.
>>>>>>>> >>>>> PCTelescope will first gather a temporary matrix associated
>>>>>>>> with your coarse level operator assuming a comm size of 288 living on 
>>>>>>>> the
>>>>>>>> comm with size 18432.
>>>>>>>> >>>>> This matrix will require approximately 0.5 * 64 = 32 MB per
>>>>>>>> core on the 288 ranks.
>>>>>>>> >>>>> This matrix is then used to form a new MPIAIJ matrix on the
>>>>>>>> subcomm, thus require another 32 MB per rank.
>>>>>>>> >>>>> The temporary matrix is now destroyed.
>>>>>>>> >>>>>
>>>>>>>> >>>>> * Because a DMDA is detected, a permutation matrix is
>>>>>>>> assembled.
>>>>>>>> >>>>> This requires 2 doubles per point in the DMDA.
>>>>>>>> >>>>> Your coarse DMDA contains 92 x 16 x 48 points.
>>>>>>>> >>>>> Thus the permutation matrix will require < 1 MB per MPI rank
>>>>>>>> on the sub-comm.
>>>>>>>> >>>>>
>>>>>>>> >>>>> * Lastly, the matrix is permuted. This uses MatPtAP(), but
>>>>>>>> the resulting operator will have the same memory footprint as the
>>>>>>>> unpermuted matrix (32 MB). At any stage in PCTelescope, only 2 
>>>>>>>> operators of
>>>>>>>> size 32 MB are held in memory when the DMDA is provided.
>>>>>>>> >>>>>
>>>>>>>> >>>>> From my rough estimates, the worst case memory foot print for
>>>>>>>> any given core, given your options is approximately
>>>>>>>> >>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB  = 2465 MB
>>>>>>>> >>>>> This is way below 8 GB.
>>>>>>>> >>>>>
>>>>>>>> >>>>> Note this estimate completely ignores:
>>>>>>>> >>>>> (1) the memory required for the restriction operator,
>>>>>>>> >>>>> (2) the potential growth in the number of non-zeros per row
>>>>>>>> due to Galerkin coarsening (I wished -ksp_view_pre reported the output 
>>>>>>>> from
>>>>>>>> MatView so we could see the number of non-zeros required by the coarse
>>>>>>>> level operators)
>>>>>>>> >>>>> (3) all temporary vectors required by the CG solver, and
>>>>>>>> those required by the smoothers.
>>>>>>>> >>>>> (4) internal memory allocated by MatPtAP
>>>>>>>> >>>>> (5) memory associated with IS's used within PCTelescope
>>>>>>>> >>>>>
>>>>>>>> >>>>> So either I am completely off in my estimates, or you have
>>>>>>>> not carefully estimated the memory usage of your application code.
>>>>>>>> Hopefully others might examine/correct my rough estimates
>>>>>>>> >>>>>
>>>>>>>> >>>>> Since I don't have your code I cannot access the latter.
>>>>>>>> >>>>> Since I don't have access to the same machine you are running
>>>>>>>> on, I think we need to take a step back.
>>>>>>>> >>>>>
>>>>>>>> >>>>> [1] What machine are you running on? Send me a URL if its
>>>>>>>> available
>>>>>>>> >>>>>
>>>>>>>> >>>>> [2] What discretization are you using? (I am guessing a
>>>>>>>> scalar 7 point FD stencil)
>>>>>>>> >>>>> If it's a 7 point FD stencil, we should be able to examine
>>>>>>>> the memory usage of your solver configuration using a standard, light
>>>>>>>> weight existing PETSc example, run on your machine at the same scale.
>>>>>>>> >>>>> This would hopefully enable us to correctly evaluate the
>>>>>>>> actual memory usage required by the solver configuration you are using.
>>>>>>>> >>>>>
>>>>>>>> >>>>> Thanks,
>>>>>>>> >>>>>   Dave
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>>> Frank
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>>> On 07/08/2016 10:38 PM, Dave May wrote:
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> On Saturday, 9 July 2016, frank <[email protected]> wrote:
>>>>>>>> >>>>>> Hi Barry and Dave,
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> Thank both of you for the advice.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> @Barry
>>>>>>>> >>>>>> I made a mistake in the file names in last email. I attached
>>>>>>>> the correct files this time.
>>>>>>>> >>>>>> For all the three tests, 'Telescope' is used as the coarse
>>>>>>>> preconditioner.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> == Test1:   Grid: 1536*128*384,   Process Mesh: 48*4*12
>>>>>>>> >>>>>> Part of the memory usage:  Vector   125            124
>>>>>>>> 3971904     0.
>>>>>>>> >>>>>>                                              Matrix   101
>>>>>>>> 101      9462372     0
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> == Test2: Grid: 1536*128*384,   Process Mesh: 96*8*24
>>>>>>>> >>>>>> Part of the memory usage:  Vector   125            124
>>>>>>>> 681672     0.
>>>>>>>> >>>>>>                                              Matrix   101
>>>>>>>> 101      1462180     0.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> In theory, the memory usage in Test1 should be 8 times of
>>>>>>>> Test2. In my case, it is about 6 times.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> == Test3: Grid: 3072*256*768,   Process Mesh: 96*8*24.
>>>>>>>> Sub-domain per process: 32*32*32
>>>>>>>> >>>>>> Here I get the out of memory error.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> I tried to use -mg_coarse jacobi. In this way, I don't need
>>>>>>>> to set -mg_coarse_ksp_type and -mg_coarse_pc_type explicitly, right?
>>>>>>>> >>>>>> The linear solver didn't work in this case. Petsc output
>>>>>>>> some errors.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> @Dave
>>>>>>>> >>>>>> In test3, I use only one instance of 'Telescope'. On the
>>>>>>>> coarse mesh of 'Telescope', I used LU as the preconditioner instead of 
>>>>>>>> SVD.
>>>>>>>> >>>>>> If my set the levels correctly, then on the last coarse mesh
>>>>>>>> of MG where it calls 'Telescope', the sub-domain per process is 2*2*2.
>>>>>>>> >>>>>> On the last coarse mesh of 'Telescope', there is only one
>>>>>>>> grid point per process.
>>>>>>>> >>>>>> I still got the OOM error. The detailed petsc option file is
>>>>>>>> attached.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> Do you understand the expected memory usage for the
>>>>>>>> particular parallel LU implementation you are using? I don't 
>>>>>>>> (seriously).
>>>>>>>> Replace LU with bjacobi and re-run this test. My point about solver
>>>>>>>> debugging is still valid.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> And please send the result of KSPView so we can see what is
>>>>>>>> actually used in the computations
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> Thanks
>>>>>>>> >>>>>>   Dave
>>>>>>>> >>>>>>
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> Thank you so much.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> Frank
>>>>>>>> >>>>>>
>>>>>>>> >>>>>>
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote:
>>>>>>>> >>>>>> On Jul 6, 2016, at 4:19 PM, frank <[email protected]> wrote:
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> Hi Barry,
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> Thank you for you advice.
>>>>>>>> >>>>>> I tried three test. In the 1st test, the grid is
>>>>>>>> 3072*256*768 and the process mesh is 96*8*24.
>>>>>>>> >>>>>> The linear solver is 'cg' the preconditioner is 'mg' and
>>>>>>>> 'telescope' is used as the preconditioner at the coarse mesh.
>>>>>>>> >>>>>> The system gives me the "Out of Memory" error before the
>>>>>>>> linear system is completely solved.
>>>>>>>> >>>>>> The info from '-ksp_view_pre' is attached. I seems to me
>>>>>>>> that the error occurs when it reaches the coarse mesh.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> The 2nd test uses a grid of 1536*128*384 and process mesh is
>>>>>>>> 96*8*24. The 3rd                                             test uses 
>>>>>>>> the
>>>>>>>> same grid but a different process mesh 48*4*12.
>>>>>>>> >>>>>>     Are you sure this is right? The total matrix and vector
>>>>>>>> memory usage goes from 2nd test
>>>>>>>> >>>>>>                Vector   384            383      8,193,712
>>>>>>>>  0.
>>>>>>>> >>>>>>                Matrix   103            103     11,508,688
>>>>>>>>  0.
>>>>>>>> >>>>>> to 3rd test
>>>>>>>> >>>>>>               Vector   384            383      1,590,520
>>>>>>>>  0.
>>>>>>>> >>>>>>                Matrix   103            103      3,508,664
>>>>>>>>  0.
>>>>>>>> >>>>>> that is the memory usage got smaller but if you have only
>>>>>>>> 1/8th the processes and the same grid it should have gotten about 8 
>>>>>>>> times
>>>>>>>> bigger. Did you maybe cut the grid by a factor of 8 also? If so that 
>>>>>>>> still
>>>>>>>> doesn't explain it because the memory usage changed by a factor of 5
>>>>>>>> something for the vectors and 3 something for the matrices.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> The linear solver and petsc options in 2nd and 3rd tests are
>>>>>>>> the same in 1st test. The linear solver works fine in both test.
>>>>>>>> >>>>>> I attached the memory usage of the 2nd and 3rd tests. The
>>>>>>>> memory info is from the option '-log_summary'. I tried to use
>>>>>>>> '-momery_info' as you suggested, but in my case petsc treated it as an
>>>>>>>> unused option. It output nothing about the memory. Do I need to add 
>>>>>>>> sth to
>>>>>>>> my code so I can use '-memory_info'?
>>>>>>>> >>>>>>     Sorry, my mistake the option is -memory_view
>>>>>>>> >>>>>>
>>>>>>>> >>>>>>    Can you run the one case with -memory_view and -mg_coarse
>>>>>>>> jacobi -ksp_max_it 1 (just so it doesn't iterate forever) to
>>>>>>>
>>>>>>>

Re: [petsc-users] Performance of the Telescope Multigrid Preconditioner

Reply via email to