Hi Dave,

I add both options and test it by solving the poisson eqn in a 1024 cube with 32^3 cores. This test used to give the OOM error. Now it runs well.
I attach the ksp_view and log_view's output in case you want to know.
I also test my original code with those petsc options by simulating a decaying turbulence in a 1024 cube. It also works. I am going to test the code on a larger scale. If there is any problem then, I will let you know.
This really helps a lot. Thank you so much.

Regards,
Frank


On 9/15/2016 3:35 AM, Dave May wrote:
HI all,

I the only unexpected memory usage I can see is associated with the call to MatPtAP().
Here is something you can try immediately.
Run your code with the additional options
  -matrap 0 -matptap_scalable

I didn't realize this before, but the default behaviour of MatPtAP in parallel is actually to to explicitly form the transpose of P (e.g. assemble R = P^T) and then compute R.A.P.
You don't want to do this. The option -matrap 0 resolves this issue.

The implementation of P^T.A.P has two variants.
The scalable implementation (with respect to memory usage) is selected via the second option -matptap_scalable.

Try it out - I see a significant memory reduction using these options for particular mesh sizes / partitions.

I've attached a cleaned up version of the code you sent me.
There were a number of memory leaks and other issues.
The main points being
  * You should call DMDAVecGetArrayF90() before VecAssembly{Begin,End}
* You should call PetscFinalize(), otherwise the option -log_summary (-log_view) will not display anything once the program has completed.


Thanks,
  Dave


On 15 September 2016 at 08:03, Hengjie Wang <hengj...@uci.edu <mailto:hengj...@uci.edu>> wrote:

    Hi Dave,

    Sorry, I should have put more comment to explain the code.
    The number of process in each dimension is the same: Px = Py=Pz=P.
    So is the domain size.
    So if the you want to run the code for a  512^3 grid points on
    16^3 cores, you need to set "-N 512 -P 16" in the command line.
    I add more comments and also fix an error in the attached code. (
    The error only effects the accuracy of solution but not the memory
    usage. )

    Thank you.
    Frank


    On 9/14/2016 9:05 PM, Dave May wrote:


    On Thursday, 15 September 2016, Dave May <dave.mayhe...@gmail.com
    <mailto:dave.mayhe...@gmail.com>> wrote:



        On Thursday, 15 September 2016, frank <hengj...@uci.edu> wrote:

            Hi,

            I write a simple code to re-produce the error. I hope
            this can help to diagnose the problem.
            The code just solves a 3d poisson equation.


        Why is the stencil width a runtime parameter?? And why is the
        default value 2? For 7-pnt FD Laplace, you only need
        a stencil width of 1.

        Was this choice made to mimic something in the
        real application code?


    Please ignore - I misunderstood your usage of the param set by -P


            I run the code on a 1024^3 mesh. The process partition is
            32 * 32 * 32. That's when I re-produce the OOM error.
            Each core has about 2G memory.
            I also run the code on a 512^3 mesh with 16 * 16 * 16
            processes. The ksp solver works fine.
            I attached the code, ksp_view_pre's output and my petsc
            option file.

            Thank you.
            Frank

            On 09/09/2016 06:38 PM, Hengjie Wang wrote:
            Hi Barry,

            I checked. On the supercomputer, I had the option
            "-ksp_view_pre" but it is not in file I sent you. I am
            sorry for the confusion.

            Regards,
            Frank

            On Friday, September 9, 2016, Barry Smith
            <bsm...@mcs.anl.gov> wrote:


                > On Sep 9, 2016, at 3:11 PM, frank
                <hengj...@uci.edu> wrote:
                >
                > Hi Barry,
                >
                > I think the first KSP view output is from
                -ksp_view_pre. Before I submitted the test, I was
                not sure whether there would be OOM error or not. So
                I added both -ksp_view_pre and -ksp_view.

                  But the options file you sent specifically does
                NOT list the -ksp_view_pre so how could it be from that?

                   Sorry to be pedantic but I've spent too much time
                in the past trying to debug from incorrect
                information and want to make sure that the
                information I have is correct before thinking.
                Please recheck exactly what happened. Rerun with the
                exact input file you emailed if that is needed.

                   Barry

                >
                > Frank
                >
                >
                > On 09/09/2016 12:38 PM, Barry Smith wrote:
                >>   Why does ksp_view2.txt have two KSP views in it
                while ksp_view1.txt has only one KSPView in it? Did
                you run two different solves in the 2 case but not
                the one?
                >>
                >>   Barry
                >>
                >>
                >>
                >>> On Sep 9, 2016, at 10:56 AM, frank
                <hengj...@uci.edu> wrote:
                >>>
                >>> Hi,
                >>>
                >>> I want to continue digging into the memory
                problem here.
                >>> I did find a work around in the past, which is
                to use less cores per node so that each core has 8G
                memory. However this is deficient and expensive. I
                hope to locate the place that uses the most memory.
                >>>
                >>> Here is a brief summary of the tests I did in past:
                >>>> Test1:   Mesh 1536*128*384  |  Process Mesh 48*4*12
                >>> Maximum (over computational time) process
                memory:        total 7.0727e+08
                >>> Current process memory:                  total
                7.0727e+08
                >>> Maximum (over computational time) space
                PetscMalloc()ed:  total 6.3908e+11
>>> Current space PetscMalloc()ed: total 1.8275e+09
                >>>
                >>>> Test2:    Mesh 1536*128*384  |  Process Mesh
                96*8*24
                >>> Maximum (over computational time) process
                memory:        total 5.9431e+09
                >>> Current process memory:                  total
                5.9431e+09
                >>> Maximum (over computational time) space
                PetscMalloc()ed:  total 5.3202e+12
>>> Current space PetscMalloc()ed: total 5.4844e+09
                >>>
                >>>> Test3:    Mesh 3072*256*768  |  Process Mesh
                96*8*24
                >>>     OOM( Out Of Memory ) killer of the
                supercomputer terminated the job during "KSPSolve".
                >>>
                >>> I attached the output of ksp_view( the third
                test's output is from ksp_view_pre ), memory_view
                and also the petsc options.
                >>>
                >>> In all the tests, each core can access about 2G
                memory. In test3, there are 4223139840 non-zeros in
                the matrix. This will consume about 1.74M, using
                double precision. Considering some extra memory used
                to store integer index, 2G memory should still be
                way enough.
                >>>
                >>> Is there a way to find out which part of
                KSPSolve uses the most memory?
                >>> Thank you so much.
                >>>
                >>> BTW, there are 4 options remains unused and I
                don't understand why they are omitted:
                >>> -mg_coarse_telescope_mg_coarse_ksp_type value:
                preonly
                >>> -mg_coarse_telescope_mg_coarse_pc_type value:
                bjacobi
                >>> -mg_coarse_telescope_mg_levels_ksp_max_it value: 1
                >>> -mg_coarse_telescope_mg_levels_ksp_type value:
                richardson
                >>>
                >>>
                >>> Regards,
                >>> Frank
                >>>
                >>> On 07/13/2016 05:47 PM, Dave May wrote:
                >>>>
                >>>> On 14 July 2016 at 01:07, frank
                <hengj...@uci.edu> wrote:
                >>>> Hi Dave,
                >>>>
                >>>> Sorry for the late reply.
                >>>> Thank you so much for your detailed reply.
                >>>>
                >>>> I have a question about the estimation of the
                memory usage. There are 4223139840 allocated
                non-zeros and 18432 MPI processes. Double precision
                is used. So the memory per process is:
                >>>>   4223139840 * 8bytes / 18432 / 1024 / 1024 =
                1.74M ?
                >>>> Did I do sth wrong here? Because this seems too
                small.
                >>>>
                >>>> No - I totally f***ed it up. You are correct.
                That'll teach me for fumbling around with my iphone
                calculator and not using my brain. (Note that to
                convert to MB just divide by 1e6, not 1024^2 -
                although I apparently cannot convert between units
                correctly....)
                >>>>
                >>>> From the PETSc objects associated with the
                solver, It looks like it _should_ run with 2GB per
                MPI rank. Sorry for my mistake. Possibilities are:
                somewhere in your usage of PETSc you've introduced a
                memory leak; PETSc is doing a huge over allocation
                (e.g. as per our discussion of MatPtAP); or in your
                application code there are other objects you have
                forgotten to log the memory for.
                >>>>
                >>>>
                >>>>
                >>>> I am running this job on Bluewater
                >>>> I am using the 7 points FD stencil in 3D.
                >>>>
                >>>> I thought so on both counts.
                >>>>
                >>>> I apologize that I made a stupid mistake in
                computing the memory per core. My settings render
                each core can access only 2G memory on average
                instead of 8G which I mentioned in previous email. I
                re-run the job with 8G memory per core on average
                and there is no "Out Of Memory" error. I would do
                more test to see if there is still some memory issue.
                >>>>
                >>>> Ok. I'd still like to know where the memory was
                being used since my estimates were off.
                >>>>
                >>>>
                >>>> Thanks,
                >>>>   Dave
                >>>>
                >>>> Regards,
                >>>> Frank
                >>>>
                >>>>
                >>>>
                >>>> On 07/11/2016 01:18 PM, Dave May wrote:
                >>>>> Hi Frank,
                >>>>>
                >>>>>
                >>>>> On 11 July 2016 at 19:14, frank
                <hengj...@uci.edu> wrote:
                >>>>> Hi Dave,
                >>>>>
                >>>>> I re-run the test using bjacobi as the
                preconditioner on the coarse mesh of telescope. The
                Grid is 3072*256*768 and process mesh is 96*8*24.
                The petsc option file is attached.
                >>>>> I still got the "Out Of Memory" error. The
                error occurred before the linear solver finished one
                step. So I don't have the full info from ksp_view.
                The info from ksp_view_pre is attached.
                >>>>>
                >>>>> Okay - that is essentially useless (sorry)
                >>>>>
                >>>>> It seems to me that the error occurred when
                the decomposition was going to be changed.
                >>>>>
                >>>>> Based on what information?
                >>>>> Running with -info would give us more clues,
                but will create a ton of output.
                >>>>> Please try running the case which failed with
                -info
                >>>>>  I had another test with a grid of
                1536*128*384 and the same process mesh as above.
                There was no error. The ksp_view info is attached
                for comparison.
                >>>>> Thank you.
                >>>>>
                >>>>>
                >>>>> [3] Here is my crude estimate of your memory
                usage.
                >>>>> I'll target the biggest memory hogs only to
                get an order of magnitude estimate
                >>>>>
                >>>>> * The Fine grid operator contains 4223139840
                non-zeros --> 1.8 GB per MPI rank assuming double
                precision.
                >>>>> The indices for the AIJ could amount to
                another 0.3 GB (assuming 32 bit integers)
                >>>>>
                >>>>> * You use 5 levels of coarsening, so the other
                operators should represent (collectively)
                >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4  ~ 300
                MB per MPI rank on the communicator with 18432 ranks.
                >>>>> The coarse grid should consume ~ 0.5 MB per
                MPI rank on the communicator with 18432 ranks.
                >>>>>
                >>>>> * You use a reduction factor of 64, making the
                new communicator with 288 MPI ranks.
                >>>>> PCTelescope will first gather a temporary
                matrix associated with your coarse level operator
                assuming a comm size of 288 living on the comm with
                size 18432.
                >>>>> This matrix will require approximately 0.5 *
                64 = 32 MB per core on the 288 ranks.
                >>>>> This matrix is then used to form a new MPIAIJ
                matrix on the subcomm, thus require another 32 MB
                per rank.
                >>>>> The temporary matrix is now destroyed.
                >>>>>
                >>>>> * Because a DMDA is detected, a permutation
                matrix is assembled.
                >>>>> This requires 2 doubles per point in the DMDA.
                >>>>> Your coarse DMDA contains 92 x 16 x 48 points.
                >>>>> Thus the permutation matrix will require < 1
                MB per MPI rank on the sub-comm.
                >>>>>
                >>>>> * Lastly, the matrix is permuted. This uses
                MatPtAP(), but the resulting operator will have the
                same memory footprint as the unpermuted matrix (32
                MB). At any stage in PCTelescope, only 2 operators
                of size 32 MB are held in memory when the DMDA is
                provided.
                >>>>>
                >>>>> From my rough estimates, the worst case memory
                foot print for any given core, given your options is
                approximately
                >>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB  = 2465 MB
                >>>>> This is way below 8 GB.
                >>>>>
                >>>>> Note this estimate completely ignores:
                >>>>> (1) the memory required for the restriction
                operator,
                >>>>> (2) the potential growth in the number of
                non-zeros per row due to Galerkin coarsening (I
                wished -ksp_view_pre reported the output from
                MatView so we could see the number of non-zeros
                required by the coarse level operators)
                >>>>> (3) all temporary vectors required by the CG
                solver, and those required by the smoothers.
                >>>>> (4) internal memory allocated by MatPtAP
                >>>>> (5) memory associated with IS's used within
                PCTelescope
                >>>>>
                >>>>> So either I am completely off in my estimates,
                or you have not carefully estimated the memory usage
                of your application code. Hopefully others might
                examine/correct my rough estimates
                >>>>>
                >>>>> Since I don't have your code I cannot access
                the latter.
                >>>>> Since I don't have access to the same machine
                you are running on, I think we need to take a step back.
                >>>>>
                >>>>> [1] What machine are you running on? Send me a
                URL if its available
                >>>>>
                >>>>> [2] What discretization are you using? (I am
                guessing a scalar 7 point FD stencil)
                >>>>> If it's a 7 point FD stencil, we should be
                able to examine the memory usage of your solver
                configuration using a standard, light weight
                existing PETSc example, run on your machine at the
                same scale.
                >>>>> This would hopefully enable us to correctly
                evaluate the actual memory usage required by the
                solver configuration you are using.
                >>>>>
                >>>>> Thanks,
                >>>>>   Dave
                >>>>>
                >>>>>
                >>>>> Frank
                >>>>>
                >>>>>
                >>>>>
                >>>>>
                >>>>> On 07/08/2016 10:38 PM, Dave May wrote:
                >>>>>>
                >>>>>> On Saturday, 9 July 2016, frank
                <hengj...@uci.edu> wrote:
                >>>>>> Hi Barry and Dave,
                >>>>>>
                >>>>>> Thank both of you for the advice.
                >>>>>>
                >>>>>> @Barry
                >>>>>> I made a mistake in the file names in last
                email. I attached the correct files this time.
                >>>>>> For all the three tests, 'Telescope' is used
                as the coarse preconditioner.
                >>>>>>
                >>>>>> == Test1:  Grid: 1536*128*384,   Process
                Mesh: 48*4*12
                >>>>>> Part of the memory usage:  Vector   125 124
                3971904     0.
                >>>>>> Matrix   101 101      9462372     0
                >>>>>>
                >>>>>> == Test2: Grid: 1536*128*384,   Process Mesh:
                96*8*24
                >>>>>> Part of the memory usage:  Vector   125 124
                681672     0.
                >>>>>> Matrix   101 101      1462180     0.
                >>>>>>
                >>>>>> In theory, the memory usage in Test1 should
                be 8 times of Test2. In my case, it is about 6 times.
                >>>>>>
                >>>>>> == Test3: Grid: 3072*256*768,   Process Mesh:
                96*8*24. Sub-domain per process: 32*32*32
                >>>>>> Here I get the out of memory error.
                >>>>>>
                >>>>>> I tried to use -mg_coarse jacobi. In this
                way, I don't need to set -mg_coarse_ksp_type and
                -mg_coarse_pc_type explicitly, right?
                >>>>>> The linear solver didn't work in this case.
                Petsc output some errors.
                >>>>>>
                >>>>>> @Dave
                >>>>>> In test3, I use only one instance of
                'Telescope'. On the coarse mesh of 'Telescope', I
                used LU as the preconditioner instead of SVD.
                >>>>>> If my set the levels correctly, then on the
                last coarse mesh of MG where it calls 'Telescope',
                the sub-domain per process is 2*2*2.
                >>>>>> On the last coarse mesh of 'Telescope', there
                is only one grid point per process.
                >>>>>> I still got the OOM error. The detailed petsc
                option file is attached.
                >>>>>>
                >>>>>> Do you understand the expected memory usage
                for the particular parallel LU implementation you
                are using? I don't (seriously). Replace LU with
                bjacobi and re-run this test. My point about solver
                debugging is still valid.
                >>>>>>
                >>>>>> And please send the result of KSPView so we
                can see what is actually used in the computations
                >>>>>>
                >>>>>> Thanks
                >>>>>>   Dave
                >>>>>>
                >>>>>>
                >>>>>> Thank you so much.
                >>>>>>
                >>>>>> Frank
                >>>>>>
                >>>>>>
                >>>>>>
                >>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote:
                >>>>>> On Jul 6, 2016, at 4:19 PM, frank
                <hengj...@uci.edu> wrote:
                >>>>>>
                >>>>>> Hi Barry,
                >>>>>>
                >>>>>> Thank you for you advice.
                >>>>>> I tried three test. In the 1st test, the grid
                is 3072*256*768 and the process mesh is 96*8*24.
                >>>>>> The linear solver is 'cg' the preconditioner
                is 'mg' and 'telescope' is used as the
                preconditioner at the coarse mesh.
                >>>>>> The system gives me the "Out of Memory" error
                before the linear system is completely solved.
                >>>>>> The info from '-ksp_view_pre' is attached. I
                seems to me that the error occurs when it reaches
                the coarse mesh.
                >>>>>>
                >>>>>> The 2nd test uses a grid of 1536*128*384 and
                process mesh is 96*8*24. The 3rd  test uses the same
                grid but a different process mesh 48*4*12.
                >>>>>>     Are you sure this is right? The total
                matrix and vector memory usage goes from 2nd test
                >>>>>>   Vector   384            383 8,193,712     0.
                >>>>>>   Matrix   103            103  11,508,688     0.
                >>>>>> to 3rd test
                >>>>>>  Vector   384            383 1,590,520     0.
                >>>>>>   Matrix   103            103 3,508,664     0.
                >>>>>> that is the memory usage got smaller but if
                you have only 1/8th the processes and the same grid
                it should have gotten about 8 times bigger. Did you
                maybe cut the grid by a factor of 8 also? If so that
                still doesn't explain it because the memory usage
                changed by a factor of 5 something for the vectors
                and 3 something for the matrices.
                >>>>>>
                >>>>>>
                >>>>>> The linear solver and petsc options in 2nd
                and 3rd tests are the same in 1st test. The linear
                solver works fine in both test.
                >>>>>> I attached the memory usage of the 2nd and
                3rd tests. The memory info is from the option
                '-log_summary'. I tried to use '-momery_info' as you
                suggested, but in my case petsc treated it as an
                unused option. It output nothing about the memory.
                Do I need to add sth to my code so I can use
                '-memory_info'?
                >>>>>>     Sorry, my mistake the option is -memory_view
                >>>>>>
                >>>>>>    Can you run the one case with -memory_view
                and -mg_coarse jacobi -ksp_max_it 1 (just so it
                doesn't iterate forever) to see how much memory is
                used without the telescope? Also run case 2 the same
                way.
                >>>>>>
                >>>>>>    Barry
                >>>>>>
                >>>>>>
                >>>>>>
                >>>>>> In both tests the memory usage is not large.
                >>>>>>
                >>>>>> It seems to me that it might be the
                'telescope' preconditioner that allocated a lot of
                memory and caused the error in the 1st test.
                >>>>>> Is there is a way to show how much memory it
                allocated?
                >>>>>>
                >>>>>> Frank
                >>>>>>
                >>>>>> On 07/05/2016 03:37 PM, Barry Smith wrote:
                >>>>>>    Frank,
                >>>>>>
                >>>>>>      You can run with -ksp_view_pre to have
                it "view" the KSP before the solve so hopefully it
                gets that far.
                >>>>>>
                >>>>>>       Please run the problem that does fit
                with -memory_info when the problem completes it will
                show the "high water mark" for PETSc allocated
                memory and total memory used. We first want to look
                at these numbers to see if it is using more memory
                than you expect. You could also run with say half
                the grid spacing to see how the memory usage scaled
                with the increase in grid points. Make the runs also
                with -log_view and send all the output from these
                options.
                >>>>>>
                >>>>>>     Barry
                >>>>>>
                >>>>>> On Jul 5, 2016, at 5:23 PM, frank
                <hengj...@uci.edu> wrote:
                >>>>>>
                >>>>>> Hi,
                >>>>>>
                >>>>>> I am using the CG ksp solver and Multigrid
                preconditioner  to solve a linear system in parallel.
                >>>>>> I chose to use the 'Telescope' as the
                preconditioner on the coarse mesh for its good
                performance.
                >>>>>> The petsc options file is attached.
                >>>>>>
                >>>>>> The domain is a 3d box.
                >>>>>> It works well when the grid is  1536*128*384
                and the process mesh is 96*8*24. When I double the
                size of grid and                                keep
                the same process mesh and petsc options, I get an
                "out of memory" error from the super-cluster I am using.
                >>>>>> Each process has access to at least 8G
                memory, which should be more than enough for my
                application. I am sure that all the other parts of
                my code( except the linear solver ) do not use much
                memory. So I doubt if there is something wrong with
                the linear solver.
                >>>>>> The error occurs before the linear system is
                completely solved so I don't have the info from ksp
                view. I am not able to re-produce the error with a
                smaller problem either.
                >>>>>> In addition, I tried to use the block jacobi
                as the preconditioner with the same grid and same
                decomposition. The linear solver runs extremely slow
                but there is no memory error.
                >>>>>>
                >>>>>> How can I diagnose what exactly cause the error?
                >>>>>> Thank you so much.
                >>>>>>
                >>>>>> Frank
                >>>>>> <petsc_options.txt>
                >>>>>>
                
<ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt>
                >>>>>>
                >>>>>
                >>>>
                >>>
                
<ksp_view1.txt><ksp_view2.txt><ksp_view3.txt><memory1.txt><memory2.txt><petsc_options1.txt><petsc_options2.txt><petsc_options3.txt>
                >





KSP Object: 32768 MPI processes
  type: cg
  maximum iterations=10000
  tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
  left preconditioning
  using nonzero initial guess
  using UNPRECONDITIONED norm type for convergence test
PC Object: 32768 MPI processes
  type: mg
    MG: type is MULTIPLICATIVE, levels=5 cycles=v
      Cycles per PCApply=1
      Using Galerkin computed coarse grid matrices
  Coarse grid solver -- level -------------------------------
    KSP Object: (mg_coarse_) 32768 MPI processes
      type: preonly
      maximum iterations=10000, initial guess is zero
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_coarse_) 32768 MPI processes
      type: telescope
        Telescope: parent comm size reduction factor = 64
        Telescope: comm_size = 32768 , subcomm_size = 512
        Telescope: subcomm type: interlaced
          Telescope: DMDA detected
        DMDA Object:    (mg_coarse_telescope_repart_)    512 MPI processes
          M 64 N 64 P 64 m 8 n 8 p 8 dof 1 overlap 1
        KSP Object: (mg_coarse_telescope_) 512 MPI processes
          type: preonly
          maximum iterations=10000, initial guess is zero
          tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
          left preconditioning
          using NONE norm type for convergence test
        PC Object: (mg_coarse_telescope_) 512 MPI processes
          type: mg
            MG: type is MULTIPLICATIVE, levels=3 cycles=v
              Cycles per PCApply=1
              Using Galerkin computed coarse grid matrices
          Coarse grid solver -- level -------------------------------
            KSP Object: (mg_coarse_telescope_mg_coarse_) 512 MPI processes
              type: preonly
              maximum iterations=10000, initial guess is zero
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using NONE norm type for convergence test
            PC Object: (mg_coarse_telescope_mg_coarse_) 512 MPI processes
              type: redundant
                Redundant preconditioner: First (color=0) of 512 PCs follows
              linear system matrix = precond matrix:
              Mat Object: 512 MPI processes
                type: mpiaij
                rows=4096, cols=4096
                total: nonzeros=110592, allocated nonzeros=110592
                total number of mallocs used during MatSetValues calls =0
                  using I-node (on process 0) routines: found 2 nodes, limit 
used is 5
          Down solver (pre-smoother) on level 1 -------------------------------
            KSP Object: (mg_coarse_telescope_mg_levels_1_) 512 MPI processes
              type: richardson
                Richardson: damping factor=1.
              maximum iterations=1
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using nonzero initial guess
              using NONE norm type for convergence test
            PC Object: (mg_coarse_telescope_mg_levels_1_) 512 MPI processes
              type: sor
                SOR: type = local_symmetric, iterations = 1, local iterations = 
1, omega = 1.
              linear system matrix = precond matrix:
              Mat Object: 512 MPI processes
                type: mpiaij
                rows=32768, cols=32768
                total: nonzeros=884736, allocated nonzeros=884736
                total number of mallocs used during MatSetValues calls =0
                  not using I-node (on process 0) routines
          Up solver (post-smoother) same as down solver (pre-smoother)
          Down solver (pre-smoother) on level 2 -------------------------------
            KSP Object: (mg_coarse_telescope_mg_levels_2_) 512 MPI processes
              type: richardson
                Richardson: damping factor=1.
              maximum iterations=1
              tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
              left preconditioning
              using nonzero initial guess
              using NONE norm type for convergence test
            PC Object: (mg_coarse_telescope_mg_levels_2_) 512 MPI processes
              type: sor
                SOR: type = local_symmetric, iterations = 1, local iterations = 
1, omega = 1.
              linear system matrix = precond matrix:
              Mat Object: 512 MPI processes
                type: mpiaij
                rows=262144, cols=262144
                total: nonzeros=7077888, allocated nonzeros=7077888
                total number of mallocs used during MatSetValues calls =0
                  not using I-node (on process 0) routines
          Up solver (post-smoother) same as down solver (pre-smoother)
          linear system matrix = precond matrix:
          Mat Object: 512 MPI processes
            type: mpiaij
            rows=262144, cols=262144
            total: nonzeros=7077888, allocated nonzeros=7077888
            total number of mallocs used during MatSetValues calls =0
              not using I-node (on process 0) routines
                      KSP Object:       
(mg_coarse_telescope_mg_coarse_redundant_)       1 MPI processes
                        type: preonly
                        maximum iterations=10000, initial guess is zero
                        tolerances:  relative=1e-05, absolute=1e-50, 
divergence=10000.
                        left preconditioning
                        using NONE norm type for convergence test
                      PC Object:       
(mg_coarse_telescope_mg_coarse_redundant_)       1 MPI processes
                        type: bjacobi
                          block Jacobi: number of blocks = 1
                          Local solve is same for all blocks, in the following 
KSP and PC objects:
                          KSP Object:       
(mg_coarse_telescope_mg_coarse_redundant_sub_)       1 MPI processes
                            type: preonly
                            maximum iterations=10000, initial guess is zero
                            tolerances:  relative=1e-05, absolute=1e-50, 
divergence=10000.
                            left preconditioning
                            using NONE norm type for convergence test
                          PC Object:       
(mg_coarse_telescope_mg_coarse_redundant_sub_)       1 MPI processes
                            type: ilu
                              ILU: out-of-place factorization
                              0 levels of fill
                              tolerance for zero pivot 2.22045e-14
                              matrix ordering: natural
                              factor fill ratio given 1., needed 1.
                                Factored matrix follows:
                                  Mat Object:       1 MPI processes
                                    type: seqaij
                                    rows=4096, cols=4096
                                    package used to perform factorization: petsc
                                    total: nonzeros=110592, allocated 
nonzeros=110592
                                    total number of mallocs used during 
MatSetValues calls =0
                                      not using I-node routines
                            linear system matrix = precond matrix:
                            Mat Object:       1 MPI processes
                              type: seqaij
                              rows=4096, cols=4096
                              total: nonzeros=110592, allocated nonzeros=110592
                              total number of mallocs used during MatSetValues 
calls =0
                                not using I-node routines
                        linear system matrix = precond matrix:
                        Mat Object:       1 MPI processes
                          type: seqaij
                          rows=4096, cols=4096
                          total: nonzeros=110592, allocated nonzeros=110592
                          total number of mallocs used during MatSetValues 
calls =0
                            not using I-node routines
      linear system matrix = precond matrix:
      Mat Object: 32768 MPI processes
        type: mpiaij
        rows=262144, cols=262144
        total: nonzeros=7077888, allocated nonzeros=7077888
        total number of mallocs used during MatSetValues calls =0
          using I-node (on process 0) routines: found 2 nodes, limit used is 5
  Down solver (pre-smoother) on level 1 -------------------------------
    KSP Object: (mg_levels_1_) 32768 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object: (mg_levels_1_) 32768 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, 
omega = 1.
      linear system matrix = precond matrix:
      Mat Object: 32768 MPI processes
        type: mpiaij
        rows=2097152, cols=2097152
        total: nonzeros=56623104, allocated nonzeros=56623104
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 2 -------------------------------
    KSP Object: (mg_levels_2_) 32768 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object: (mg_levels_2_) 32768 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, 
omega = 1.
      linear system matrix = precond matrix:
      Mat Object: 32768 MPI processes
        type: mpiaij
        rows=16777216, cols=16777216
        total: nonzeros=452984832, allocated nonzeros=452984832
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 3 -------------------------------
    KSP Object: (mg_levels_3_) 32768 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object: (mg_levels_3_) 32768 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, 
omega = 1.
      linear system matrix = precond matrix:
      Mat Object: 32768 MPI processes
        type: mpiaij
        rows=134217728, cols=134217728
        total: nonzeros=3623878656, allocated nonzeros=3623878656
        total number of mallocs used during MatSetValues calls =0
          not using I-node (on process 0) routines
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 4 -------------------------------
    KSP Object: (mg_levels_4_) 32768 MPI processes
      type: richardson
        Richardson: damping factor=1.
      maximum iterations=1
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using nonzero initial guess
      using NONE norm type for convergence test
    PC Object: (mg_levels_4_) 32768 MPI processes
      type: sor
        SOR: type = local_symmetric, iterations = 1, local iterations = 1, 
omega = 1.
      linear system matrix = precond matrix:
      Mat Object: 32768 MPI processes
        type: mpiaij
        rows=1073741824, cols=1073741824
        total: nonzeros=7516192768, allocated nonzeros=7516192768
        total number of mallocs used during MatSetValues calls =0
          has attached null space
  Up solver (post-smoother) same as down solver (pre-smoother)
  linear system matrix = precond matrix:
  Mat Object: 32768 MPI processes
    type: mpiaij
    rows=1073741824, cols=1073741824
    total: nonzeros=7516192768, allocated nonzeros=7516192768
    total number of mallocs used during MatSetValues calls =0
      has attached null space
32768 processors, by hengjie Fri Sep 16 04:29:10 2016
Using Petsc Development GIT revision: v3.7.3-1056-geeb1ceb  GIT Date: 
2016-08-02 10:00:58 -0500

                         Max       Max/Min        Avg      Total 
Time (sec):           3.595e+01      1.00092   3.595e+01
Objects:              4.240e+02      1.61217   2.655e+02
Flops:                7.348e+07      1.09866   6.699e+07  2.195e+12
Flops/sec:            2.044e+06      1.09875   1.863e+06  6.106e+10
Memory:               1.110e+09      1.00000              3.636e+13
MPI Messages:         5.004e+04     11.27696   4.668e+03  1.530e+08
MPI Message Lengths:  4.805e+06      1.27794   8.088e+02  1.237e+11
MPI Reductions:       2.296e+03      1.48994

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N 
flops
                            and VecAXPY() for complex vectors of length N --> 
8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- 
Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     
Avg         %Total   counts   %Total 
 0:      Main Stage: 3.5947e+01 100.0%  2.1951e+12 100.0%  1.530e+08 100.0%  
8.088e+02      100.0%  1.551e+03  67.5% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting 
output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and 
PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in 
this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all 
processors)
------------------------------------------------------------------------------------------------------------------------


      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was compiled with a debugging option,      #
      #   To get timing results run ./configure                #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #
      ##########################################################


Event                Count      Time (sec)     Flops                            
 --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

VecTDot               30 1.0 1.9905e-01 1.4 1.97e+06 1.0 0.0e+00 0.0e+00 
6.0e+01  1  3  0  0  3   1  3  0  0  4 323650
VecNorm               16 1.0 3.9425e-01 3.5 1.05e+06 1.0 0.0e+00 0.0e+00 
3.2e+01  1  2  0  0  1   1  2  0  0  2 87152
VecScale              75 1.7 2.3286e-02 2.0 4.52e+04 1.3 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 50363
VecCopy               17 1.0 3.8621e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet               442 1.7 9.8095e-03 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY               60 1.0 3.5868e-02 1.3 3.93e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  0  6  0  0  0   0  6  0  0  0 3592294
VecAYPX              119 1.3 1.7319e-02 1.3 1.98e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  0  3  0  0  0   0  3  0  0  0 3728684
VecAssemblyBegin       1 1.0 1.0757e-01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
4.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         1 1.0 2.7490e-04 3.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin      471 1.5 5.8588e-02 3.4 0.00e+00 0.0 1.2e+08 8.1e+02 
0.0e+00  0  0 81 81  0   0  0 81 81  0     0
VecScatterEnd        471 1.5 1.2934e+00 6.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0   3  0  0  0  0     0
MatMult              135 1.3 2.8880e-01 1.4 2.33e+07 1.0 5.0e+07 1.7e+03 
0.0e+00  1 34 32 66  0   1 34 32 66  0 2597254
MatMultAdd            90 1.5 1.1149e-01 2.9 3.85e+06 1.0 1.4e+07 3.2e+02 
0.0e+00  0  6  9  4  0   0  6  9  4  0 1114404
MatMultTranspose     111 1.4 3.0435e-01 1.3 4.11e+06 1.0 1.7e+07 2.8e+02 
8.0e+01  1  6 11  4  3   1  6 11  4  5 435479
MatSolve              15 0.0 2.0206e-02 0.0 3.26e+06 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 82513
MatSOR               180 1.5 9.9816e-01 1.3 2.32e+07 1.0 3.9e+07 2.4e+02 
1.2e+00  2 33 25  8  0   2 33 25  8  0 727846
MatLUFactorNum         1 0.0 2.4225e-02 0.0 1.60e+06 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 33762
MatILUFactorSym        1 0.0 2.5048e-03 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatConvert             1 0.0 7.5793e-04 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatResidual           90 1.5 3.7126e-01 1.2 1.11e+07 1.0 4.2e+07 8.0e+02 
6.0e+01  1 16 27 27  3   1 16 27 27  4 942007
MatAssemblyBegin      33 1.4 7.2762e-01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
4.8e+01  2  0  0  0  2   2  0  0  0  3     0
MatAssemblyEnd        33 1.4 1.4643e+00 1.1 0.00e+00 0.0 1.1e+07 1.2e+02 
2.5e+02  4  0  7  1 11   4  0  7  1 16     0
MatGetRowIJ            1 0.0 1.5974e-05 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetSubMatrice       2 2.0 3.4627e-01 3.7 0.00e+00 0.0 1.6e+05 5.4e+02 
6.1e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 0.0 1.9929e-03 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatView               13 2.2 1.0639e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
1.2e+01  0  0  0  0  1   0  0  0  0  1     0
MatPtAP                7 1.4 5.2281e+00 1.0 5.25e+06 1.0 2.4e+07 8.8e+02 
2.1e+02 14  8 15 17  9  14  8 15 17 14 31939
MatPtAPSymbolic        7 1.4 4.0818e+00 1.0 0.00e+00 0.0 1.4e+07 1.1e+03 
7.5e+01 11  0  9 12  3  11  0  9 12  5     0
MatPtAPNumeric         7 1.4 1.1755e+00 1.0 5.25e+06 1.0 9.6e+06 5.7e+02 
1.4e+02  3  8  6  4  6   3  8  6  4  9 142046
MatRedundantMat        1 0.0 1.3647e-02 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
7.8e-02  0  0  0  0  0   0  0  0  0  0     0
MatMPIConcateSeq       1 0.0 2.7197e-01 0.0 0.00e+00 0.0 2.7e+04 4.0e+01 
6.1e-01  0  0  0  0  0   0  0  0  0  0     0
MatGetLocalMat         7 1.4 1.3259e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetBrAoCol          7 1.4 6.9566e-02 2.8 0.00e+00 0.0 1.1e+07 1.1e+03 
0.0e+00  0  0  7 10  0   0  0  7 10  0     0
MatGetSymTrans        14 1.4 2.2139e-02 5.8 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
DMCoarsen              6 1.5 3.3237e-01 1.1 0.00e+00 0.0 1.6e+06 1.7e+02 
2.1e+02  1  0  1  0  9   1  0  1  0 13     0
DMCreateInterp         6 1.5 7.6958e-01 1.1 2.57e+05 1.0 2.8e+06 1.6e+02 
2.0e+02  2  0  2  0  9   2  0  2  0 13 10763
KSPSetUp              12 2.0 1.1138e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
3.5e+01  0  0  0  0  2   0  0  0  0  2     0
KSPSolve               1 1.0 1.2628e+01 1.0 7.35e+07 1.1 1.5e+08 8.0e+02 
1.4e+03 35100 99 99 59  35100 99 99 87 173826
PCSetUp                3 3.0 9.2140e+00 1.1 7.10e+06 1.3 2.9e+07 7.4e+02 
7.9e+02 23  8 19 18 34  23  8 19 18 51 19110
PCSetUpOnBlocks       15 0.0 2.8822e-02 0.0 1.60e+06 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 28377
PCApply               15 1.0 3.5384e+00 1.0 5.58e+07 1.1 1.2e+08 6.3e+02 
3.7e+02 10 74 79 62 16  10 74 79 62 24 457052
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Vector   197            197      4396000     0.
      Vector Scatter    27             27       333392     0.
              Matrix    66             66     14132608     0.
   Matrix Null Space     1              1          592     0.
    Distributed Mesh     8              8        40832     0.
Star Forest Bipartite Graph    16             16        13568     0.
     Discrete System     8              8         7008     0.
           Index Set    60             60       341672     0.
   IS L to G Mapping     8              8       195776     0.
       Krylov Solver    12             12        14760     0.
     DMKSP interface     6              6         3888     0.
      Preconditioner    12             12        11928     0.
              Viewer     3              2         1664     0.
========================================================================================================================
Average time to get PetscTime(): 9.53674e-08
Average time for MPI_Barrier(): 0.000146198
Average time for zero size MPI_Send(): 3.66852e-06

Reply via email to