Hi Barry,
Thanks for your comment.
I took max time as 2.868e+02 sec from the below table, as the total time taken
to solve the entire problem. Was I correct in my assumption?
Using this max time, I manually tried to calculate the individual event
percentage to see if it matched up to T%.
Max Max/Min Avg Total
Time (sec): 2.868e+02 1.000 2.868e+02
Objects: 3.800e+01 1.000 3.800e+01
Flop: 8.659e+11 1.004 8.642e+11 1.728e+12
Flop/sec: 3.019e+09 1.004 3.013e+09 6.026e+09
Memory: 1.764e+10 1.004 1.760e+10 3.521e+10
MPI Messages: 3.430e+02 1.000 3.430e+02 6.860e+02
MPI Message Lengths: 7.134e+08 1.000 2.080e+06 1.427e+09
MPI Reductions: 4.637e+03 1.000
Best,
Karthik.
From: Barry Smith <[email protected]>
Date: Friday, 1 October 2021 at 16:03
To: "Chockalingam, Karthikeyan (STFC,DL,HC)"
<[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: [petsc-users] (percent time in this phase)
What is "max time"? It is best to gather timings with a stage
PetscLogStagePush() to get a separate subtable for exactly the part of the code
you want timing for. For example if you are studying linear solver time you
want only the solver part of the code in the stage, not the time to build the
matrix and right hand side.
It is very difficult to get really correct publishable reliable performance
numbers when comparing solvers with similar timings on parallel machines and
especially GPUs. Values can be very dependent on particular compilers used, the
specific hardware used, generation of memory used etc.
Barry
On Oct 1, 2021, at 8:51 AM, Karthikeyan Chockalingam - STFC UKRI
<[email protected]<mailto:[email protected]>>
wrote:
Hi Barry,
Yes, looks like it is computationally faster using GPUs. I used block jacobi as
the preconditioner.
I have attached the output data for cpu and gpu using -ksp_view. I am not sure;
what information I should be looking at using -ksp_view?
I have an outstanding question,
event time
T%
cal = (event/max)*100
max time
2.87E+02
KSPSolve
1.58E+02
53
55.2
MatMult
1.08E+01
4
3.76
PCApply
1.31E+02
37
45.6
VecNorm
6.23E+01
11
21.7
Matt couple of days back helped breakdown KSPSolve (53 %) ~ PCApply (37%) +
VecNorm (11%) + MatMul (4%)
However, when I calculate T% manually using max time, the numbers for PCApply
and VecNorm are way off as you can see from the above table.
As a result, the cumulative sum of event time don’t match up to KSPSolve. Can
you please let me know what I might be doing wrong?
I will be performing extensive benchmarking of various preconditioners and
comparing their performance on cpus and gpus, so this information is critical.
Many thanks!
Karthik.
From: Barry Smith <[email protected]<mailto:[email protected]>>
Date: Thursday, 30 September 2021 at 15:47
To: "Chockalingam, Karthikeyan (STFC,DL,HC)"
<[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: [petsc-users] (percent time in this phase)
The MatSolve is no better on the GPUs then on the CPU; while other parts of
the computation seem to speed up nicely. What is the result of -ksp_view ? Are
you using ILU(0) as the preconditioner, this will not solve well on the GPU,
its solve is essentially sequential. You won't want to use ILU(0) in this way
on GPUs.
Barry
On Sep 30, 2021, at 9:41 AM, Karthikeyan Chockalingam - STFC UKRI
<[email protected]<mailto:[email protected]>>
wrote:
Based on your feedback from yesterday. I was trying to breakdown KSPSolve.
Please find the attached bar plot. The numbers are not adding up at least for
GPUs.
Your feedback from yesterday were based on T%.
I plotted the time spend on each event, hoping that the cumulative sum would
add up to KSPSolve time.
Kind regards,
Karthik.
From: Matthew Knepley <[email protected]<mailto:[email protected]>>
Date: Thursday, 30 September 2021 at 13:52
To: "Chockalingam, Karthikeyan (STFC,DL,HC)"
<[email protected]<mailto:[email protected]>>
Cc: Barry Smith <[email protected]<mailto:[email protected]>>,
"[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: [petsc-users] (percent time in this phase)
On Thu, Sep 30, 2021 at 8:50 AM Karthikeyan Chockalingam - STFC UKRI
<[email protected]<mailto:[email protected]>>
wrote:
When comparing the MatSolve data for
GPU
MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00
0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0
0.00e+00 100
MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
and CPU
MatSolve 352 1.0 1.3553e+02 1.0 1.02e+11 1.0 0.0e+00 0.0e+00
0.0e+00 35 34 0 0 0 35 34 0 0 0 4489
the time spent is almost the same for this preconditioner. Look like
MatCUSPARSSolAnl is called only twice (since I am running on two cores)
mpirun -n 2 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511
-dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi
-ksp_monitor
So would it be fair to assume MatCUSPARSSolAnl is not accounted for in
MatSolve and it is an exclusive event?
Looks like that.
Thanks
Matt
KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~
100 %
Best,
Karthik.
From: Matthew Knepley <[email protected]<mailto:[email protected]>>
Date: Wednesday, 29 September 2021 at 16:29
To: "Chockalingam, Karthikeyan (STFC,DL,HC)"
<[email protected]<mailto:[email protected]>>
Cc: Barry Smith <[email protected]<mailto:[email protected]>>,
"[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: [petsc-users] %T (percent time in this phase)
On Wed, Sep 29, 2021 at 10:18 AM Karthikeyan Chockalingam - STFC UKRI
<[email protected]<mailto:[email protected]>>
wrote:
Thank you!
Just to summarize
KSPSolve (53%) + PCSetup (16%) + DMCreateMat (23%) + MatCUSPARSSolAnl (9%) ~
100 %
You didn’t happen to mention how MatCUSPARSSolAnl is accounted for? Am I right
in accounting for it as above?
I am not sure.I thought it might be the GPU part of MatSolve(). I will have to
look in the code. I am not as familiar with the GPU part.
MatCUSPARSSolAnl 2 1.0 3.2338e+01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 9 0 0 0 0 9 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
Finally, I believe the vector events, VecNorn, VecTDot, VecAXPY, and VecAYPX
are mutually exclusive?
Yes.
Thanks,
Matt
Best,
Karthik.
From: Matthew Knepley <[email protected]<mailto:[email protected]>>
Date: Wednesday, 29 September 2021 at 11:58
To: "Chockalingam, Karthikeyan (STFC,DL,HC)"
<[email protected]<mailto:[email protected]>>
Cc: Barry Smith <[email protected]<mailto:[email protected]>>,
"[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: [petsc-users] %T (percent time in this phase)
On Wed, Sep 29, 2021 at 6:24 AM Karthikeyan Chockalingam - STFC UKRI
<[email protected]<mailto:[email protected]>>
wrote:
Thank you Mathew. Now, it is all making sense to me.
From data file ksp_ex45_N511_gpu_2.txt
KSPSolve (53%) + KSPSetup (0%) = PCSetup (16%) + PCApply (37%).
However, you said “So an iteration would mostly consist of MatMult + PCApply,
with some vector work”
1) You do one solve, but 2 KSPSetUp()s. You must be running on more than one
process and using Block-Jacobi . Half the time is spent in the solve (53%)
KSPSetUp 2 1.0 5.3149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
2.4e+01 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0
0.00e+00 0
KSPSolve 1 1.0 1.5837e+02 1.1 8.63e+11 1.0 6.8e+02 2.1e+06
4.4e+03 53100100100 95 53100100100 96 10881 11730 1022 6.40e+03 1021
8.17e-03 100
2) The preconditioner look like BJacobi-ILU. The setup time is 16%, which is
all setup of the individual blocks, and this is all used by the numerical ILU
factorization.
PCSetUp 2 1.0 4.9623e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 16 0 0 0 0 16
0 0 0 0 58 0 2 6.93e+03 0 0.00e+00 0 PCSetUpOnBlocks 1 1.0 4.9274e+01 1.3
1.45e+09 1.0 0.0e+00 0.0e+00 0.0e+00 15 0 0 0 0 15 0 0 0 0 59 0 2 6.93e+03 0
0.00e+00 0
MatLUFactorNum 1 1.0 4.6126e+01 1.3 1.45e+09 1.0 0.0e+00 0.0e+00
0.0e+00 14 0 0 0 0 14 0 0 0 0 63 0 2 6.93e+03 0
0.00e+00 0
MatILUFactorSym 1 1.0 2.5110e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
3) The preconditioner application takes 37% of the time, which is all solving
the factors and recorded in MatSolve(). Matrix multiplication takes 4%.
PCApply 341 1.0 1.3068e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00 0.0e+00 37 34 0 0 0
37 34 0 0 0 4516 4523 1 5.34e+02 0 0.00e+00 100
MatSolve 341 1.0 1.3009e+02 1.6 2.96e+11 1.0 0.0e+00 0.0e+00
0.0e+00 36 34 0 0 0 36 34 0 0 0 4536 4538 1 5.34e+02 0
0.00e+00 100
MatMult 341 1.0 1.0774e+01 1.1 2.96e+11 1.0 6.9e+02 2.1e+06
2.0e+00 4 34100100 0 4 34100100 0 54801 66441 2 5.86e+03 0
0.00e+00 100
4) The significant vector time is all in norms (11%) since they are really slow
on the GPU.
VecNorm 342 1.0 6.2261e+01129.9 4.57e+10 1.0 0.0e+00 0.0e+00
6.8e+02 11 5 0 0 15 11 5 0 0 15 1466 196884 0 0.00e+00 342
2.74e-03 100
VecTDot 680 1.0 1.7107e+00 1.3 9.09e+10 1.0 0.0e+00 0.0e+00
1.4e+03 1 10 0 0 29 1 10 0 0 29 106079 133922 0 0.00e+00 680
5.44e-03 100
VecAXPY 681 1.0 3.2036e+00 1.7 9.10e+10 1.0 0.0e+00 0.0e+00
0.0e+00 1 11 0 0 0 1 11 0 0 0 56728 58367 682 5.34e+02 0
0.00e+00 100
VecAYPX 339 1.0 2.6502e+00 1.8 4.53e+10 1.0 0.0e+00 0.0e+00
0.0e+00 1 5 0 0 0 1 5 0 0 0 34136 34153 339 2.71e-03 0
0.00e+00 100
So the solve time is:
53% ~ 37% + 4% + 11%
and the setup time is about 16%. I was wrong about the SetUp time being
included, as it is outside the event:
https://gitlab.com/petsc/petsc/-/blob/main/src/ksp/ksp/interface/itfunc.c#L852
It looks like the remainder of the time (23%) is spent preallocating the matrix.
Thanks,
Matt
The MalMult event is 4 %. How does this event figure into the above equation;
if preconditioning (MatMult + PCApply) is included in KSPSolve?
Best,
Karthik.
From: Matthew Knepley <[email protected]<mailto:[email protected]>>
Date: Wednesday, 29 September 2021 at 10:58
To: "Chockalingam, Karthikeyan (STFC,DL,HC)"
<[email protected]<mailto:[email protected]>>
Cc: Barry Smith <[email protected]<mailto:[email protected]>>,
"[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: [petsc-users] %T (percent time in this phase)
On Wed, Sep 29, 2021 at 5:52 AM Karthikeyan Chockalingam - STFC UKRI
<[email protected]<mailto:[email protected]>>
wrote:
That was helpful. I would like to provide some additional details of my run on
cpus and gpus. Please find the following attachments:
1. graph.pdf a plot showing overall time and various petsc events.
2. ksp_ex45_N511_cpu_6.txt data file of the log_summary
3. ksp_ex45_N511_gpu_2.txt data file of the log_summary
I used the following petsc options for cpu
mpirun -n 6 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511
-dm_mat_type mpiaij -dm_vec_type mpi -ksp_type cg -pc_type bjacobi -ksp_monitor
and for gpus
mpirun -n 1 ./ex45 -log_summary -da_grid_x 511 -da_grid_y 511 -da_grid_z 511
-dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type cg -pc_type bjacobi
-ksp_monitor
to run the following problem
https://petsc.org/release/src/ksp/ksp/tutorials/ex45.c.html
From the above code, I see is there no individual function called KSPSetUp(),
so I gather KSPSetDM, KSPSetComputeInitialGuess, KSPSetComputeRHS,
kSPSetComputeOperators all are timed together as KSPSetUp. For this example, is
KSPSetUp time and KSPSolve time mutually exclusive?
No, KSPSetUp() will be contained in KSPSolve() if it is called automatically.
In your response you said that
“PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on
how much of the preconditioner construction can take place early, so depends
exactly on the preconditioner used.”
I don’t see a explicit call to PCSetUp() or PCApply() in ex45; so for this
particular preconditioner (bjacobi) how can I tell how they are timed?
They are all inside KSPSolve(). If you have a preconditioned linear solve, the
oreconditioning happens during the iteration. So an iteration would mostly
consist of MatMult + PCApply, with some vector work.
I am hoping to time KSP solving and preconditioning mutually exclusively.
I am not sure that concept makes sense here. See above.
Thanks,
Matt
Kind regards,
Karthik.
From: Barry Smith <[email protected]<mailto:[email protected]>>
Date: Tuesday, 28 September 2021 at 19:19
To: "Chockalingam, Karthikeyan (STFC,DL,HC)"
<[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: [petsc-users] %T (percent time in this phase)
On Sep 28, 2021, at 12:11 PM, Karthikeyan Chockalingam - STFC UKRI
<[email protected]<mailto:[email protected]>>
wrote:
Thanks for Barry for your response.
I was just benchmarking the problem with various preconditioner on cpu and gpu.
I understand, it is not possible to get mutually exclusive timing.
However, can you tell if KSPSolve time includes both PCSetup and PCApply? And
if KSPSolve and KSPSetup are mutually exclusive? Likewise for PCSetUp and
PCApply.
If you do not call KSPSetUp() separately from KSPSolve() then its time is
included with KSPSolve().
PCSetUp() time may be in KSPSetUp() or it maybe in PCApply() it depends on
how much of the preconditioner construction can take place early, so depends
exactly on the preconditioner used.
So yes the answer is not totally satisfying. The one thing I would recommend
is to not call KSPSetUp() directly and then KSPSolve() will always include the
total time of the solve plus all setup time. PCApply will contain all the time
to apply the preconditioner but may also include some setup time.
Barry
Best,
Karthik.
From: Barry Smith <[email protected]<mailto:[email protected]>>
Date: Tuesday, 28 September 2021 at 16:56
To: "Chockalingam, Karthikeyan (STFC,DL,HC)"
<[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: [petsc-users] %T (percent time in this phase)
On Sep 28, 2021, at 10:55 AM, Karthikeyan Chockalingam - STFC UKRI
<[email protected]<mailto:[email protected]>>
wrote:
Hello,
I ran ex45 in the KPS tutorial, which is a 3D finite-difference Poisson
problem. I noticed from the output from using the flag -log_summary that for
various events their respective %T (percent time in this phase) do not add up
to 100 but rather exceeds 100. So, I gather there is some overlap among these
events. I am primarily looking at the events KSPSetUp, KSPSolve, PCSetUp and
PCSolve. Is it possible to get a mutually exclusive %T or Time for these
individual events? I have attached the log_summary output file from my run for
your reference.
For nested solvers it is tricky to get the times to be mutually exclusive
because some parts of the building of the preconditioner is for some
preconditioners delayed until the solve has started.
It looks like you are using the default preconditioner options which for this
example are taking more or less no time since so many iterations are needed. It
is best to use -pc_type mg to use geometric multigrid on this problem.
Barry
Thanks!
Karthik.
This email and any attachments are intended solely for the use of the named
recipients. If you are not the intended recipient you must not use, disclose,
copy or distribute this email or any of its attachments and should notify the
sender immediately and delete this email from your system. UK Research and
Innovation (UKRI) has taken every reasonable precaution to minimise risk of
this email or any attachments containing viruses or malware but the recipient
should carry out its own virus and malware checks before opening the
attachments. UKRI does not accept any liability for any losses or damages which
the recipient may sustain due to presence of any viruses.
<ksp_ex45_N511_cpu_6.txt>
--
What most experimenters take for granted before they begin their experiments is
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>
--
What most experimenters take for granted before they begin their experiments is
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>
--
What most experimenters take for granted before they begin their experiments is
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>
--
What most experimenters take for granted before they begin their experiments is
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>
<KSPSolve.pdf>
<ksp_ex45_N511_cpu_6.txt><ksp_ex45_N511_gpu_2.txt>