Re: [petsc-users] Python PETSc performance vs scipy ZVODE

Niclas Götting Thu, 10 Aug 2023 02:40:23 -0700

Thank you both for the very quick answer!

So far, I compiled PETSc with debugging turned on, but I think it shouldstill be faster than standard scipy in both cases. Actually, Stefano'sanswer has got me very far already; now I only define the RHS of the ODEand no Jacobian (I wonder, why the documentation suggests otherwise,though). I had the following four tries at implementing the RHS:


1. def rhsfunc1(ts, t, u, F):
        scale = 0.5 * (5 < t < 10)
        (l + scale * pump).mult(u, F)
2. def rhsfunc2(ts, t, u, F):
        l.mult(u, F)
        scale = 0.5 * (5 < t < 10)
        (scale * pump).multAdd(u, F, F)
3. def rhsfunc3(ts, t, u, F):
        l.mult(u, F)
        scale = 0.5 * (5 < t < 10)
        if scale != 0:
            pump.scale(scale)
            pump.multAdd(u, F, F)
            pump.scale(1/scale)
4. def rhsfunc4(ts, t, u, F):
        tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
        l.mult(u, F)
        scale = 0.5 * (5 < t < 10)
        tmp_pump.axpy(scale, pump,
   structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
        tmp_pump.multAdd(u, F, F)

They all yield the same results, but with 50it/s, 800it/, 2300it/s and1900it/s, respectively, which is a huge performance boost (almost 7times as fast as scipy, with PETSc debugging still turned on). As thescale function will most likely be a gaussian in the future, I thinkthat option 3 will be become numerically unstable and I'll have to gowith option 4, which is already faster than I expected. If you think itis possible to speed up the RHS calculation even more, I'd be happy tohear your suggestions; the -log_view is attached to this message.

One last point: If I didn't misunderstand the documentation athttps://petsc.org/release/manual/ts/#special-cases, should this maybe bechanged?


Best regards
Niclas

On 09.08.23 17:51, Stefano Zampini wrote:

TSRK is an explicit solver. Unless you are changing the ts type fromcommand line, the explicit jacobian should not be needed. On top ofBarry's suggestion, I would suggest you to write the explicit RHSinstead of assembly a throw away matrix every time that function needsto be sampled.

On Wed, Aug 9, 2023, 17:09 Niclas Götting<ngoett...@itp.uni-bremen.de> wrote:


    Hi all,

    I'm currently trying to convert a quantum simulation from scipy to
    PETSc. The problem itself is extremely simple and of the form
    \dot{u}(t)
    = (A_const + f(t)*B_const)*u(t), where f(t) in this simple test
    case is
    a square function. The matrices A_const and B_const are extremely
    sparse
    and therefore I thought, the problem will be well suited for PETSc.
    Currently, I solve the ODE with the following procedure in scipy
    (I can
    provide the necessary data files, if needed, but they are just some
    trace-preserving, very sparse matrices):

    import numpy as np
    import scipy.sparse
    import scipy.integrate

    from tqdm import tqdm


    l = np.load("../liouvillian.npy")
    pump = np.load("../pump_operator.npy")
    state = np.load("../initial_state.npy")

    l = scipy.sparse.csr_array(l)
    pump = scipy.sparse.csr_array(pump)

    def f(t, y, *args):
         return (l + 0.5 * (5 < t < 10) * pump) @ y
         #return l @ y # Uncomment for f(t) = 0

    dt = 0.1
    NUM_STEPS = 200
    res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
    solver =
    scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
    times = []
    for i in tqdm(range(NUM_STEPS)):
         res[i, :] = solver.integrate(solver.t + dt)
         times.append(solver.t)

    Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm reports
    about 330it/s on my machine. When converting the code to PETSc, I
    came
    to the following result (according to the chapter
    https://petsc.org/main/manual/ts/#special-cases)

    import sys
    import petsc4py
    petsc4py.init(args=sys.argv)
    import numpy as np
    import scipy.sparse

    from tqdm import tqdm
    from petsc4py import PETSc

    comm = PETSc.COMM_WORLD


    def mat_to_real(arr):
         return np.block([[arr.real, -arr.imag], [arr.imag,
    arr.real]]).astype(np.float64)

    def mat_to_petsc_aij(arr):
         arr_sc_sp = scipy.sparse.csr_array(arr)
         mat = PETSc.Mat().createAIJ(arr.shape[0], comm=comm)
         rstart, rend = mat.getOwnershipRange()
         print(rstart, rend)
         print(arr.shape[0])
         print(mat.sizes)
         I = arr_sc_sp.indptr[rstart : rend + 1] -
    arr_sc_sp.indptr[rstart]
         J = arr_sc_sp.indices[arr_sc_sp.indptr[rstart] :
    arr_sc_sp.indptr[rend]]
         V = arr_sc_sp.data[arr_sc_sp.indptr[rstart] :
    arr_sc_sp.indptr[rend]]

         print(I.shape, J.shape, V.shape)
         mat.setValuesCSR(I, J, V)
         mat.assemble()
         return mat


    l = np.load("../liouvillian.npy")
    l = mat_to_real(l)
    pump = np.load("../pump_operator.npy")
    pump = mat_to_real(pump)
    state = np.load("../initial_state.npy")
    state = np.hstack([state.real, state.imag]).astype(np.float64)

    l = mat_to_petsc_aij(l)
    pump = mat_to_petsc_aij(pump)


    jac = l.duplicate()
    for i in range(8192):
         jac.setValue(i, i, 0)
    jac.assemble()
    jac += l

    vec = l.createVecRight()
    vec.setValues(np.arange(state.shape[0], dtype=np.int32), state)
    vec.assemble()


    dt = 0.1

    ts = PETSc.TS().create(comm=comm)
    ts.setFromOptions()
    ts.setProblemType(ts.ProblemType.LINEAR)
    ts.setEquationType(ts.EquationType.ODE_EXPLICIT)
    ts.setType(ts.Type.RK)
    ts.setRKType(ts.RKType.RK3BS)
    ts.setTime(0)
    print("KSP:", ts.getKSP().getType())
    print("KSP PC:",ts.getKSP().getPC().getType())
    print("SNES :", ts.getSNES().getType())

    def jacobian(ts, t, u, Amat, Pmat):
         Amat.zeroEntries()
         Amat.aypx(1, l,
    structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)
         Amat.axpy(0.5 * (5 < t < 10), pump,
    structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)

    ts.setRHSFunction(PETSc.TS.computeRHSFunctionLinear)
    #ts.setRHSJacobian(PETSc.TS.computeRHSJacobianConstant, l, l) #
    Uncomment for f(t) = 0
    ts.setRHSJacobian(jacobian, jac)

    NUM_STEPS = 200
    res = np.empty((NUM_STEPS, 8192), dtype=np.float64)
    times = []
    rstart, rend = vec.getOwnershipRange()
    for i in tqdm(range(NUM_STEPS)):
         time = ts.getTime()
         ts.setMaxTime(time + dt)
         ts.solve(vec)
         res[i, rstart:rend] = vec.getArray()[:]
         times.append(time)

    I decomposed the complex ODE into a larger real ODE, so that I can
    easily switch maybe to GPU computation later on. Now, the
    solutions of
    both scripts are very much identical, but PETSc runs about 3 times
    slower at 120it/s on my machine. I don't use MPI for PETSc yet.

    I strongly suppose that the problem lies within the jacobian
    definition,
    as PETSc is about 3 times *faster* than scipy with f(t) = 0 and
    therefore a constant jacobian.

    Thank you in advance.

    All the best,
    Niclas

****************************************************************************************************************************************************************
***                                WIDEN YOUR WINDOW TO 160 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document                                 ***
****************************************************************************************************************************************************************

------------------------------------------------------------------ PETSc Performance Summary: ------------------------------------------------------------------



      ##########################################################
      #                                                        #
      #                       WARNING!!!                       #
      #                                                        #
      #   This code was compiled with a debugging option.      #
      #   To get timing results run ./configure                #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #
      ##########################################################


petsc_tests.py on a  named faust6-arch with 1 processor, by niclas Thu Aug 10 11:31:26 2023
Using Petsc Release Version 3.19.3, Jun 29, 2023 

                         Max       Max/Min     Avg       Total
Time (sec):           5,205e+00     1,000   5,205e+00
Objects:              3,700e+01     1,000   3,700e+01
Flops:                2,108e+08     1,000   2,108e+08  2,108e+08
Flops/sec:            4,051e+07     1,000   4,051e+07  4,051e+07
Memory (bytes):       3,768e+06     1,000   3,768e+06  3,768e+06
MPI Msg Count:        0,000e+00     0,000   0,000e+00  0,000e+00
MPI Msg Len (bytes):  0,000e+00     0,000   0,000e+00  0,000e+00
MPI Reductions:       0,000e+00     0,000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total    Count   %Total     Avg         %Total    Count   %Total
 0:      Main Stage: 5,2048e+00 100,0%  2,1082e+08 100,0%  0,000e+00   0,0%  0,000e+00        0,0%  0,000e+00   0,0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                  Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   AvgLen: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------


      ##########################################################
      #                                                        #
      #                       WARNING!!!                       #
      #                                                        #
      #   This code was compiled with a debugging option.      #
      #   To get timing results run ./configure                #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #
      ##########################################################


Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

VecCopy             1399 1,0 4,4269e-03 1,0 0,00e+00 0,0 0,0e+00 0,0e+00 0,0e+00  0  0  0  0  0   0  0  0  0  0     0
VecMAXPY             800 1,0 4,7529e-03 1,0 3,28e+07 1,0 0,0e+00 0,0e+00 0,0e+00  0 16  0  0  0   0 16  0  0  0  6894
VecAssemblyBegin       1 1,0 4,4800e-07 1,0 0,00e+00 0,0 0,0e+00 0,0e+00 0,0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         1 1,0 2,3400e-07 1,0 0,00e+00 0,0 0,0e+00 0,0e+00 0,0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMult              601 1,0 6,1720e-02 1,0 1,41e+08 1,0 0,0e+00 0,0e+00 0,0e+00  1 67  0  0  0   1 67  0  0  0  2287
MatMultAdd           601 1,0 1,5415e-02 1,0 2,95e+07 1,0 0,0e+00 0,0e+00 0,0e+00  0 14  0  0  0   0 14  0  0  0  1916
MatConvert             1 1,0 9,1752e-05 1,0 0,00e+00 0,0 0,0e+00 0,0e+00 0,0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyBegin       2 1,0 2,5590e-06 1,0 0,00e+00 0,0 0,0e+00 0,0e+00 0,0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         2 1,0 9,4703e-03 1,0 0,00e+00 0,0 0,0e+00 0,0e+00 0,0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries       601 1,0 3,6496e-03 1,0 0,00e+00 0,0 0,0e+00 0,0e+00 0,0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAXPY              150 1,0 2,4584e-03 1,0 7,37e+06 1,0 0,0e+00 0,0e+00 0,0e+00  0  3  0  0  0   0  3  0  0  0  2999
TSStep               200 1,0 9,8012e-02 1,0 2,11e+08 1,0 0,0e+00 0,0e+00 0,0e+00  2 100  0  0  0   2 100  0  0  0  2151
TSFunctionEval       601 1,0 8,7199e-02 1,0 1,78e+08 1,0 0,0e+00 0,0e+00 0,0e+00  2 84  0  0  0   2 84  0  0  0  2042

--- Event Stage 1: Unknown

------------------------------------------------------------------------------------------------------------------------

Object Type          Creations   Destructions. Reports information only for process 0.

--- Event Stage 0: Main Stage

           Container     3              3
              Viewer     1              0
   Star Forest Graph     4              4
              Vector    10             10
              Matrix     3              3
      Preconditioner     1              1
       Krylov Solver     1              1
                SNES     1              1
              DMSNES     3              3
      SNESLineSearch     1              1
             TSAdapt     1              1
                  TS     1              1
                DMTS     1              1
    Distributed Mesh     2              2
     Discrete System     2              2
           Weak Form     2              2

--- Event Stage 1: Unknown

========================================================================================================================
Average time to get PetscTime(): 3,46e-08
#PETSc Option Table entries:
-log_view # (source: command line)
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --prefix=/opt/petsc/linux-c-opt --with-shared-libraries=1 --with-petsc4py=1 --with-mpi-f90module-visibility=0 --with-cc=/usr/bin/mpicc --with-cxx=/usr/bin/mpicxx --with-fc=/usr/bin/mpifort --with-fftw=1 --with-hdf5=1 --with-hdf5-fortran-bindings=1 --with-metis=1 --with-suitesparse=1 --with-superlu-lib=-lsuperlu --with-superlu-include=/usr/include/superlu --COPTFLAGS=-O3 -march=native --CXXOPTFLAGS=-O3 -march=native --FOPTFLAGS=-O3 -march=native
-----------------------------------------
Libraries compiled on 2023-08-03 09:45:14 on reproducible 
Machine characteristics: Linux-6.4.7-arch1-2-x86_64-with-glibc2.37
Using PETSc directory: /opt/petsc/linux-c-opt
Using PETSc arch: 
-----------------------------------------

Using C compiler: /usr/bin/mpicc  -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -O3  
Using Fortran compiler: /usr/bin/mpifort  -fPIC -Wall -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch -Wno-unused-dummy-argument -O3    
-----------------------------------------

Using include paths: -I/opt/petsc/linux-c-opt/include -I/usr/include/superlu
-----------------------------------------

Using C linker: /usr/bin/mpicc
Using Fortran linker: /usr/bin/mpifort
Using libraries: -Wl,-rpath,/opt/petsc/linux-c-opt/lib -L/opt/petsc/linux-c-opt/lib -lpetsc -Wl,-rpath,/usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1 -L/usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1 -lspqr -lumfpack -lklu -lcholmod -lbtf -lccolamd -lcolamd -lcamd -lamd -lsuitesparseconfig -lsuperlu -lfftw3_mpi -lfftw3 -llapack -lblas -lhdf5hl_fortran -lhdf5_fortran -lhdf5_hl -lhdf5 -lmetis -lm -lX11 -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -lquadmath
-----------------------------------------



      ##########################################################
      #                                                        #
      #                       WARNING!!!                       #
      #                                                        #
      #   This code was compiled with a debugging option.      #
      #   To get timing results run ./configure                #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #
      ##########################################################

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

Reply via email to