Re: [petsc-users] Question about memory usage in Multigrid preconditioner

frank Fri, 09 Sep 2016 13:16:23 -0700

Hi Barry,

I think the first KSP view output is from -ksp_view_pre. Before Isubmitted the test, I was not sure whether there would be OOM error ornot. So I added both -ksp_view_pre and -ksp_view.


Frank


On 09/09/2016 12:38 PM, Barry Smith wrote:

   Why does ksp_view2.txt have two KSP views in it while ksp_view1.txt has only 
one KSPView in it? Did you run two different solves in the 2 case but not the 
one?

   Barry

On Sep 9, 2016, at 10:56 AM, frank <hengj...@uci.edu> wrote:

Hi,

I want to continue digging into the memory problem here.
I did find a work around in the past, which is to use less cores per node so 
that each core has 8G memory. However this is deficient and expensive. I hope 
to locate the place that uses the most memory.

Here is a brief summary of the tests I did in past:

Test1:   Mesh 1536*128*384  |  Process Mesh 48*4*12

Maximum (over computational time) process memory:           total 7.0727e+08
Current process memory:                                                         
total 7.0727e+08
Maximum (over computational time) space PetscMalloc()ed:  total 6.3908e+11
Current space PetscMalloc()ed:                                                
total 1.8275e+09

Test2:    Mesh 1536*128*384  |  Process Mesh 96*8*24

Maximum (over computational time) process memory:           total 5.9431e+09
Current process memory:                                                         
total 5.9431e+09
Maximum (over computational time) space PetscMalloc()ed:  total 5.3202e+12
Current space PetscMalloc()ed:                                                 
total 5.4844e+09

Test3:    Mesh 3072*256*768  |  Process Mesh 96*8*24

     OOM( Out Of Memory ) killer of the supercomputer terminated the job during 
"KSPSolve".

I attached the output of ksp_view( the third test's output is from ksp_view_pre 
), memory_view and also the petsc options.

In all the tests, each core can access about 2G memory. In test3, there are 
4223139840 non-zeros in the matrix. This will consume about 1.74M, using double 
precision. Considering some extra memory used to store integer index, 2G memory 
should still be way enough.

Is there a way to find out which part of KSPSolve uses the most memory?
Thank you so much.

BTW, there are 4 options remains unused and I don't understand why they are 
omitted:
-mg_coarse_telescope_mg_coarse_ksp_type value: preonly
-mg_coarse_telescope_mg_coarse_pc_type value: bjacobi
-mg_coarse_telescope_mg_levels_ksp_max_it value: 1
-mg_coarse_telescope_mg_levels_ksp_type value: richardson


Regards,
Frank

On 07/13/2016 05:47 PM, Dave May wrote:


On 14 July 2016 at 01:07, frank <hengj...@uci.edu> wrote:
Hi Dave,

Sorry for the late reply.
Thank you so much for your detailed reply.

I have a question about the estimation of the memory usage. There are 
4223139840 allocated non-zeros and 18432 MPI processes. Double precision is 
used. So the memory per process is:
   4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ?
Did I do sth wrong here? Because this seems too small.

No - I totally f***ed it up. You are correct. That'll teach me for fumbling 
around with my iphone calculator and not using my brain. (Note that to convert 
to MB just divide by 1e6, not 1024^2 - although I apparently cannot convert 
between units correctly....)

 From the PETSc objects associated with the solver, It looks like it _should_ 
run with 2GB per MPI rank. Sorry for my mistake. Possibilities are: somewhere 
in your usage of PETSc you've introduced a memory leak; PETSc is doing a huge 
over allocation (e.g. as per our discussion of MatPtAP); or in your application 
code there are other objects you have forgotten to log the memory for.



I am running this job on Bluewater
I am using the 7 points FD stencil in 3D.

I thought so on both counts.

I apologize that I made a stupid mistake in computing the memory per core. My settings 
render each core can access only 2G memory on average instead of 8G which I mentioned in 
previous email. I re-run the job with 8G memory per core on average and there is no 
"Out Of Memory" error. I would do more test to see if there is still some 
memory issue.

Ok. I'd still like to know where the memory was being used since my estimates 
were off.


Thanks,
   Dave

Regards,
Frank



On 07/11/2016 01:18 PM, Dave May wrote:

Hi Frank,


On 11 July 2016 at 19:14, frank <hengj...@uci.edu> wrote:
Hi Dave,

I re-run the test using bjacobi as the preconditioner on the coarse mesh of 
telescope. The Grid is 3072*256*768 and process mesh is 96*8*24. The petsc 
option file is attached.
I still got the "Out Of Memory" error. The error occurred before the linear 
solver finished one step. So I don't have the full info from ksp_view. The info from 
ksp_view_pre is attached.

Okay - that is essentially useless (sorry)

It seems to me that the error occurred when the decomposition was going to be 
changed.

Based on what information?
Running with -info would give us more clues, but will create a ton of output.
Please try running the case which failed with -info

I had another test with a grid of 1536*128*384 and the same process mesh as above. There was no error. The ksp_view info is attached for comparison.

Thank you.


[3] Here is my crude estimate of your memory usage.
I'll target the biggest memory hogs only to get an order of magnitude estimate

* The Fine grid operator contains 4223139840 non-zeros --> 1.8 GB per MPI rank 
assuming double precision.
The indices for the AIJ could amount to another 0.3 GB (assuming 32 bit 
integers)

* You use 5 levels of coarsening, so the other operators should represent 
(collectively)
2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~ 300 MB per MPI rank on the 
communicator with 18432 ranks.
The coarse grid should consume ~ 0.5 MB per MPI rank on the communicator with 
18432 ranks.

* You use a reduction factor of 64, making the new communicator with 288 MPI 
ranks.
PCTelescope will first gather a temporary matrix associated with your coarse 
level operator assuming a comm size of 288 living on the comm with size 18432.
This matrix will require approximately 0.5 * 64 = 32 MB per core on the 288 
ranks.
This matrix is then used to form a new MPIAIJ matrix on the subcomm, thus 
require another 32 MB per rank.
The temporary matrix is now destroyed.

* Because a DMDA is detected, a permutation matrix is assembled.
This requires 2 doubles per point in the DMDA.
Your coarse DMDA contains 92 x 16 x 48 points.
Thus the permutation matrix will require < 1 MB per MPI rank on the sub-comm.

* Lastly, the matrix is permuted. This uses MatPtAP(), but the resulting 
operator will have the same memory footprint as the unpermuted matrix (32 MB). 
At any stage in PCTelescope, only 2 operators of size 32 MB are held in memory 
when the DMDA is provided.

 From my rough estimates, the worst case memory foot print for any given core, 
given your options is approximately
2100 MB + 300 MB + 32 MB + 32 MB + 1 MB  = 2465 MB
This is way below 8 GB.

Note this estimate completely ignores:
(1) the memory required for the restriction operator,
(2) the potential growth in the number of non-zeros per row due to Galerkin 
coarsening (I wished -ksp_view_pre reported the output from MatView so we could 
see the number of non-zeros required by the coarse level operators)
(3) all temporary vectors required by the CG solver, and those required by the 
smoothers.
(4) internal memory allocated by MatPtAP
(5) memory associated with IS's used within PCTelescope

So either I am completely off in my estimates, or you have not carefully 
estimated the memory usage of your application code. Hopefully others might 
examine/correct my rough estimates

Since I don't have your code I cannot access the latter.
Since I don't have access to the same machine you are running on, I think we 
need to take a step back.

[1] What machine are you running on? Send me a URL if its available

[2] What discretization are you using? (I am guessing a scalar 7 point FD 
stencil)
If it's a 7 point FD stencil, we should be able to examine the memory usage of 
your solver configuration using a standard, light weight existing PETSc 
example, run on your machine at the same scale.
This would hopefully enable us to correctly evaluate the actual memory usage 
required by the solver configuration you are using.

Thanks,
   Dave


Frank




On 07/08/2016 10:38 PM, Dave May wrote:


On Saturday, 9 July 2016, frank <hengj...@uci.edu> wrote:
Hi Barry and Dave,

Thank both of you for the advice.

@Barry
I made a mistake in the file names in last email. I attached the correct files 
this time.
For all the three tests, 'Telescope' is used as the coarse preconditioner.

== Test1:   Grid: 1536*128*384,   Process Mesh: 48*4*12
Part of the memory usage:  Vector   125            124 3971904     0.
                                              Matrix   101 101      9462372     0

== Test2: Grid: 1536*128*384,   Process Mesh: 96*8*24
Part of the memory usage:  Vector   125            124 681672     0.
                                              Matrix   101 101      1462180     
0.

In theory, the memory usage in Test1 should be 8 times of Test2. In my case, it 
is about 6 times.

== Test3: Grid: 3072*256*768,   Process Mesh: 96*8*24. Sub-domain per process: 
32*32*32
Here I get the out of memory error.

I tried to use -mg_coarse jacobi. In this way, I don't need to set 
-mg_coarse_ksp_type and -mg_coarse_pc_type explicitly, right?
The linear solver didn't work in this case. Petsc output some errors.

@Dave
In test3, I use only one instance of 'Telescope'. On the coarse mesh of 
'Telescope', I used LU as the preconditioner instead of SVD.
If my set the levels correctly, then on the last coarse mesh of MG where it 
calls 'Telescope', the sub-domain per process is 2*2*2.
On the last coarse mesh of 'Telescope', there is only one grid point per 
process.
I still got the OOM error. The detailed petsc option file is attached.

Do you understand the expected memory usage for the particular parallel LU 
implementation you are using? I don't (seriously). Replace LU with bjacobi and 
re-run this test. My point about solver debugging is still valid.

And please send the result of KSPView so we can see what is actually used in 
the computations

Thanks
   Dave


Thank you so much.

Frank



On 07/06/2016 02:51 PM, Barry Smith wrote:
On Jul 6, 2016, at 4:19 PM, frank <hengj...@uci.edu> wrote:

Hi Barry,

Thank you for you advice.
I tried three test. In the 1st test, the grid is 3072*256*768 and the process 
mesh is 96*8*24.
The linear solver is 'cg' the preconditioner is 'mg' and 'telescope' is used as 
the preconditioner at the coarse mesh.
The system gives me the "Out of Memory" error before the linear system is 
completely solved.
The info from '-ksp_view_pre' is attached. I seems to me that the error occurs 
when it reaches the coarse mesh.

The 2nd test uses a grid of 1536*128*384 and process mesh is 96*8*24. The 3rd   
                                          test uses the same grid but a 
different process mesh 48*4*12.
     Are you sure this is right? The total matrix and vector memory usage goes 
from 2nd test
                Vector   384            383      8,193,712     0.
                Matrix   103            103     11,508,688     0.
to 3rd test
               Vector   384            383      1,590,520     0.
                Matrix   103            103      3,508,664     0.
that is the memory usage got smaller but if you have only 1/8th the processes 
and the same grid it should have gotten about 8 times bigger. Did you maybe cut 
the grid by a factor of 8 also? If so that still doesn't explain it because the 
memory usage changed by a factor of 5 something for the vectors and 3 something 
for the matrices.


The linear solver and petsc options in 2nd and 3rd tests are the same in 1st 
test. The linear solver works fine in both test.
I attached the memory usage of the 2nd and 3rd tests. The memory info is from 
the option '-log_summary'. I tried to use '-momery_info' as you suggested, but 
in my case petsc treated it as an unused option. It output nothing about the 
memory. Do I need to add sth to my code so I can use '-memory_info'?
     Sorry, my mistake the option is -memory_view

    Can you run the one case with -memory_view and -mg_coarse jacobi 
-ksp_max_it 1 (just so it doesn't iterate forever) to see how much memory is 
used without the telescope? Also run case 2 the same way.

    Barry



In both tests the memory usage is not large.

It seems to me that it might be the 'telescope'  preconditioner that allocated 
a lot of memory and caused the error in the 1st test.
Is there is a way to show how much memory it allocated?

Frank

On 07/05/2016 03:37 PM, Barry Smith wrote:
    Frank,

      You can run with -ksp_view_pre to have it "view" the KSP before the solve 
so hopefully it gets that far.

       Please run the problem that does fit with -memory_info when the problem completes 
it will show the "high water mark" for PETSc allocated memory and total memory 
used. We first want to look at these numbers to see if it is using more memory than you 
expect. You could also run with say half the grid spacing to see how the memory usage 
scaled with the increase in grid points. Make the runs also with -log_view and send all 
the output from these options.

     Barry

On Jul 5, 2016, at 5:23 PM, frank <hengj...@uci.edu> wrote:

Hi,

I am using the CG ksp solver and Multigrid preconditioner  to solve a linear 
system in parallel.
I chose to use the 'Telescope' as the preconditioner on the coarse mesh for its 
good performance.
The petsc options file is attached.

The domain is a 3d box.
It works well when the grid is  1536*128*384 and the process mesh is 96*8*24. When I 
double the size of grid and                                                 keep the same 
process mesh and petsc options, I get an "out of memory" error from the 
super-cluster I am using.
Each process has access to at least 8G memory, which should be more than enough 
for my application. I am sure that all the other parts of my code( except the 
linear solver ) do not use much memory. So I doubt if there is something wrong 
with the linear solver.
The error occurs before the linear system is completely solved so I don't have 
the info from ksp view. I am not able to re-produce the error with a smaller 
problem either.
In addition,  I tried to use the block jacobi as the preconditioner with the 
same grid and same decomposition. The linear solver runs extremely slow but 
there is no memory error.

How can I diagnose what exactly cause the error?
Thank you so much.

Frank
<petsc_options.txt>
<ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt>

<ksp_view1.txt><ksp_view2.txt><ksp_view3.txt><memory1.txt><memory2.txt><petsc_options1.txt><petsc_options2.txt><petsc_options3.txt>

Re: [petsc-users] Question about memory usage in Multigrid preconditioner

Reply via email to