Thank you both for the very quick answer!
So far, I compiled PETSc with debugging turned on, but I think it should
still be faster than standard scipy in both cases. Actually, Stefano's
answer has got me very far already; now I only define the RHS of the ODE
and no Jacobian (I wonder, why the documentation suggests otherwise,
though). I had the following four tries at implementing the RHS:
1. def rhsfunc1(ts, t, u, F):
scale = 0.5 * (5 < t < 10)
(l + scale * pump).mult(u, F)
2. def rhsfunc2(ts, t, u, F):
l.mult(u, F)
scale = 0.5 * (5 < t < 10)
(scale * pump).multAdd(u, F, F)
3. def rhsfunc3(ts, t, u, F):
l.mult(u, F)
scale = 0.5 * (5 < t < 10)
if scale != 0:
pump.scale(scale)
pump.multAdd(u, F, F)
pump.scale(1/scale)
4. def rhsfunc4(ts, t, u, F):
tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
l.mult(u, F)
scale = 0.5 * (5 < t < 10)
tmp_pump.axpy(scale, pump,
structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
tmp_pump.multAdd(u, F, F)
They all yield the same results, but with 50it/s, 800it/, 2300it/s and
1900it/s, respectively, which is a huge performance boost (almost 7
times as fast as scipy, with PETSc debugging still turned on). As the
scale function will most likely be a gaussian in the future, I think
that option 3 will be become numerically unstable and I'll have to go
with option 4, which is already faster than I expected. If you think it
is possible to speed up the RHS calculation even more, I'd be happy to
hear your suggestions; the -log_view is attached to this message.
One last point: If I didn't misunderstand the documentation at
https://petsc.org/release/manual/ts/#special-cases, should this maybe be
changed?
Best regards
Niclas
On 09.08.23 17:51, Stefano Zampini wrote:
TSRK is an explicit solver. Unless you are changing the ts type from
command line, the explicit jacobian should not be needed. On top of
Barry's suggestion, I would suggest you to write the explicit RHS
instead of assembly a throw away matrix every time that function needs
to be sampled.
On Wed, Aug 9, 2023, 17:09 Niclas Götting
<ngoett...@itp.uni-bremen.de> wrote:
Hi all,
I'm currently trying to convert a quantum simulation from scipy to
PETSc. The problem itself is extremely simple and of the form
\dot{u}(t)
= (A_const + f(t)*B_const)*u(t), where f(t) in this simple test
case is
a square function. The matrices A_const and B_const are extremely
sparse
and therefore I thought, the problem will be well suited for PETSc.
Currently, I solve the ODE with the following procedure in scipy
(I can
provide the necessary data files, if needed, but they are just some
trace-preserving, very sparse matrices):
import numpy as np
import scipy.sparse
import scipy.integrate
from tqdm import tqdm
l = np.load("../liouvillian.npy")
pump = np.load("../pump_operator.npy")
state = np.load("../initial_state.npy")
l = scipy.sparse.csr_array(l)
pump = scipy.sparse.csr_array(pump)
def f(t, y, *args):
return (l + 0.5 * (5 < t < 10) * pump) @ y
#return l @ y # Uncomment for f(t) = 0
dt = 0.1
NUM_STEPS = 200
res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
solver =
scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
times = []
for i in tqdm(range(NUM_STEPS)):
res[i, :] = solver.integrate(solver.t + dt)
times.append(solver.t)
Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm reports
about 330it/s on my machine. When converting the code to PETSc, I
came
to the following result (according to the chapter
https://petsc.org/main/manual/ts/#special-cases)
import sys
import petsc4py
petsc4py.init(args=sys.argv)
import numpy as np
import scipy.sparse
from tqdm import tqdm
from petsc4py import PETSc
comm = PETSc.COMM_WORLD
def mat_to_real(arr):
return np.block([[arr.real, -arr.imag], [arr.imag,
arr.real]]).astype(np.float64)
def mat_to_petsc_aij(arr):
arr_sc_sp = scipy.sparse.csr_array(arr)
mat = PETSc.Mat().createAIJ(arr.shape[0], comm=comm)
rstart, rend = mat.getOwnershipRange()
print(rstart, rend)
print(arr.shape[0])
print(mat.sizes)
I = arr_sc_sp.indptr[rstart : rend + 1] -
arr_sc_sp.indptr[rstart]
J = arr_sc_sp.indices[arr_sc_sp.indptr[rstart] :
arr_sc_sp.indptr[rend]]
V = arr_sc_sp.data[arr_sc_sp.indptr[rstart] :
arr_sc_sp.indptr[rend]]
print(I.shape, J.shape, V.shape)
mat.setValuesCSR(I, J, V)
mat.assemble()
return mat
l = np.load("../liouvillian.npy")
l = mat_to_real(l)
pump = np.load("../pump_operator.npy")
pump = mat_to_real(pump)
state = np.load("../initial_state.npy")
state = np.hstack([state.real, state.imag]).astype(np.float64)
l = mat_to_petsc_aij(l)
pump = mat_to_petsc_aij(pump)
jac = l.duplicate()
for i in range(8192):
jac.setValue(i, i, 0)
jac.assemble()
jac += l
vec = l.createVecRight()
vec.setValues(np.arange(state.shape[0], dtype=np.int32), state)
vec.assemble()
dt = 0.1
ts = PETSc.TS().create(comm=comm)
ts.setFromOptions()
ts.setProblemType(ts.ProblemType.LINEAR)
ts.setEquationType(ts.EquationType.ODE_EXPLICIT)
ts.setType(ts.Type.RK)
ts.setRKType(ts.RKType.RK3BS)
ts.setTime(0)
print("KSP:", ts.getKSP().getType())
print("KSP PC:",ts.getKSP().getPC().getType())
print("SNES :", ts.getSNES().getType())
def jacobian(ts, t, u, Amat, Pmat):
Amat.zeroEntries()
Amat.aypx(1, l,
structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)
Amat.axpy(0.5 * (5 < t < 10), pump,
structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)
ts.setRHSFunction(PETSc.TS.computeRHSFunctionLinear)
#ts.setRHSJacobian(PETSc.TS.computeRHSJacobianConstant, l, l) #
Uncomment for f(t) = 0
ts.setRHSJacobian(jacobian, jac)
NUM_STEPS = 200
res = np.empty((NUM_STEPS, 8192), dtype=np.float64)
times = []
rstart, rend = vec.getOwnershipRange()
for i in tqdm(range(NUM_STEPS)):
time = ts.getTime()
ts.setMaxTime(time + dt)
ts.solve(vec)
res[i, rstart:rend] = vec.getArray()[:]
times.append(time)
I decomposed the complex ODE into a larger real ODE, so that I can
easily switch maybe to GPU computation later on. Now, the
solutions of
both scripts are very much identical, but PETSc runs about 3 times
slower at 120it/s on my machine. I don't use MPI for PETSc yet.
I strongly suppose that the problem lies within the jacobian
definition,
as PETSc is about 3 times *faster* than scipy with f(t) = 0 and
therefore a constant jacobian.
Thank you in advance.
All the best,
Niclas
****************************************************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 160 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***
****************************************************************************************************************************************************************
------------------------------------------------------------------ PETSc Performance Summary: ------------------------------------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option. #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
petsc_tests.py on a named faust6-arch with 1 processor, by niclas Thu Aug 10 11:31:26 2023
Using Petsc Release Version 3.19.3, Jun 29, 2023
Max Max/Min Avg Total
Time (sec): 5,205e+00 1,000 5,205e+00
Objects: 3,700e+01 1,000 3,700e+01
Flops: 2,108e+08 1,000 2,108e+08 2,108e+08
Flops/sec: 4,051e+07 1,000 4,051e+07 4,051e+07
Memory (bytes): 3,768e+06 1,000 3,768e+06 3,768e+06
MPI Msg Count: 0,000e+00 0,000 0,000e+00 0,000e+00
MPI Msg Len (bytes): 0,000e+00 0,000 0,000e+00 0,000e+00
MPI Reductions: 0,000e+00 0,000
Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N flops
and VecAXPY() for complex vectors of length N --> 8N flops
Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total Count %Total Avg %Total Count %Total
0: Main Stage: 5,2048e+00 100,0% 2,1082e+08 100,0% 0,000e+00 0,0% 0,000e+00 0,0% 0,000e+00 0,0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flop: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
AvgLen: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
%T - percent time in this phase %F - percent flop in this phase
%M - percent messages in this phase %L - percent message lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option. #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
Event Count Time (sec) Flop --- Global --- --- Stage ---- Total
Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
VecCopy 1399 1,0 4,4269e-03 1,0 0,00e+00 0,0 0,0e+00 0,0e+00 0,0e+00 0 0 0 0 0 0 0 0 0 0 0
VecMAXPY 800 1,0 4,7529e-03 1,0 3,28e+07 1,0 0,0e+00 0,0e+00 0,0e+00 0 16 0 0 0 0 16 0 0 0 6894
VecAssemblyBegin 1 1,0 4,4800e-07 1,0 0,00e+00 0,0 0,0e+00 0,0e+00 0,0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyEnd 1 1,0 2,3400e-07 1,0 0,00e+00 0,0 0,0e+00 0,0e+00 0,0e+00 0 0 0 0 0 0 0 0 0 0 0
MatMult 601 1,0 6,1720e-02 1,0 1,41e+08 1,0 0,0e+00 0,0e+00 0,0e+00 1 67 0 0 0 1 67 0 0 0 2287
MatMultAdd 601 1,0 1,5415e-02 1,0 2,95e+07 1,0 0,0e+00 0,0e+00 0,0e+00 0 14 0 0 0 0 14 0 0 0 1916
MatConvert 1 1,0 9,1752e-05 1,0 0,00e+00 0,0 0,0e+00 0,0e+00 0,0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyBegin 2 1,0 2,5590e-06 1,0 0,00e+00 0,0 0,0e+00 0,0e+00 0,0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 2 1,0 9,4703e-03 1,0 0,00e+00 0,0 0,0e+00 0,0e+00 0,0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 601 1,0 3,6496e-03 1,0 0,00e+00 0,0 0,0e+00 0,0e+00 0,0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAXPY 150 1,0 2,4584e-03 1,0 7,37e+06 1,0 0,0e+00 0,0e+00 0,0e+00 0 3 0 0 0 0 3 0 0 0 2999
TSStep 200 1,0 9,8012e-02 1,0 2,11e+08 1,0 0,0e+00 0,0e+00 0,0e+00 2 100 0 0 0 2 100 0 0 0 2151
TSFunctionEval 601 1,0 8,7199e-02 1,0 1,78e+08 1,0 0,0e+00 0,0e+00 0,0e+00 2 84 0 0 0 2 84 0 0 0 2042
--- Event Stage 1: Unknown
------------------------------------------------------------------------------------------------------------------------
Object Type Creations Destructions. Reports information only for process 0.
--- Event Stage 0: Main Stage
Container 3 3
Viewer 1 0
Star Forest Graph 4 4
Vector 10 10
Matrix 3 3
Preconditioner 1 1
Krylov Solver 1 1
SNES 1 1
DMSNES 3 3
SNESLineSearch 1 1
TSAdapt 1 1
TS 1 1
DMTS 1 1
Distributed Mesh 2 2
Discrete System 2 2
Weak Form 2 2
--- Event Stage 1: Unknown
========================================================================================================================
Average time to get PetscTime(): 3,46e-08
#PETSc Option Table entries:
-log_view # (source: command line)
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --prefix=/opt/petsc/linux-c-opt --with-shared-libraries=1 --with-petsc4py=1 --with-mpi-f90module-visibility=0 --with-cc=/usr/bin/mpicc --with-cxx=/usr/bin/mpicxx --with-fc=/usr/bin/mpifort --with-fftw=1 --with-hdf5=1 --with-hdf5-fortran-bindings=1 --with-metis=1 --with-suitesparse=1 --with-superlu-lib=-lsuperlu --with-superlu-include=/usr/include/superlu --COPTFLAGS=-O3 -march=native --CXXOPTFLAGS=-O3 -march=native --FOPTFLAGS=-O3 -march=native
-----------------------------------------
Libraries compiled on 2023-08-03 09:45:14 on reproducible
Machine characteristics: Linux-6.4.7-arch1-2-x86_64-with-glibc2.37
Using PETSc directory: /opt/petsc/linux-c-opt
Using PETSc arch:
-----------------------------------------
Using C compiler: /usr/bin/mpicc -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -O3
Using Fortran compiler: /usr/bin/mpifort -fPIC -Wall -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch -Wno-unused-dummy-argument -O3
-----------------------------------------
Using include paths: -I/opt/petsc/linux-c-opt/include -I/usr/include/superlu
-----------------------------------------
Using C linker: /usr/bin/mpicc
Using Fortran linker: /usr/bin/mpifort
Using libraries: -Wl,-rpath,/opt/petsc/linux-c-opt/lib -L/opt/petsc/linux-c-opt/lib -lpetsc -Wl,-rpath,/usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1 -L/usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1 -lspqr -lumfpack -lklu -lcholmod -lbtf -lccolamd -lcolamd -lcamd -lamd -lsuitesparseconfig -lsuperlu -lfftw3_mpi -lfftw3 -llapack -lblas -lhdf5hl_fortran -lhdf5_fortran -lhdf5_hl -lhdf5 -lmetis -lm -lX11 -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -lquadmath
-----------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option. #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################