Rohan,

     The flop rates for the sparse matrix-vector product are very low for an 
IBM Power 9. This is probably, at least partially, because the code is 
configured without any optimization flags. You should run ./configure with 
additional options something like COPTFLAGS="-O3"  CXXOPTFLAGS="-O3"  
FOPTFLAGS="-O3" but please consult the IBM documentation to determine exactly 
what optimization flags to use for mpixlc and mpixlf.

    When running in parallel I would expect the "sweet spot" of optimal 
performance to be roughly around 20 MPI ranks since the memory bandwidth of the 
CPU will be saturated long before you reach 40 ranks. I would recommend running 
with 1, 2, 3, 4, .... ranks to determine the optimal number of ranks. Also 
please consult the documentation on the placement of the ranks into the cores 
of the CPU; it is crucial to get this right and likely the default is far from 
correct. Essentially you want each core used to be as far away from the other 
cores being used as possible to maximize the achievable memory bandwidth. So 
the first core should be on the first socket, the second core on the second 
socket, the third core back on the first socket far from the first core (that 
is it should not share L1 or L2 cache with the first core), etc.

   The arabic-2005  matrix is not at all representative of the types of 
matrices PETSc is designed to solve. It does not come from a PDE and does not 
have the stencil structure of a matrix that comes from a PDE. PETSc's 
performance on such a matrix will be much lower than its performance for PDE 
matrices since PETSc is not designed for this type of matrix. Depending on the 
goals of your work you may want to use different matrices that come from PDEs.

  Regarding loading the matrix. Yes, it is expected that one uses a custom 
stand-along utility to read in SuiteSparse formatted matrices and converts them 
to the PETSc binary format; we do have a couple of examples of how such code 
can be written in src/mat/tutorials or tests


 Barry


> On Dec 10, 2021, at 6:54 PM, Rohan Yadav <roh...@alumni.cmu.edu> wrote:
> 
> Hi, I’m Rohan, a student working on compilation techniques for distributed 
> tensor computations. I’m looking at using PETSc as a baseline for experiments 
> I’m running, and want to understand if I’m using PETSc as it was intended to 
> achieve high performance, and if the performance I’m seeing is expected. 
> Currently, I’m just looking at SpMV operations.
> 
> My experiments are run on the Lassen Supercomputer 
> (https://hpc.llnl.gov/hardware/platforms/lassen 
> <https://hpc.llnl.gov/hardware/platforms/lassen>). The system has 40 CPUs, 4 
> V100s and an Infiniband interconnect. A visualization of the architecture is 
> here: 
> https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png 
> <https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png>.
> 
> As of now, I’m trying to understand the single-node performance of PETSc, as 
> the scaling performance onto multiple nodes appears to be as I expect. I’m 
> using the arabic-2005 sparse matrix from the SuiteSparse matrix collection, 
> detailed here: https://sparse.tamu.edu/LAW/arabic-2005 
> <https://sparse.tamu.edu/LAW/arabic-2005>. As a trusted baseline, I am 
> comparing against SpMV code generated by the TACO compiler 
> (http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races)
>  
> <http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races)>.
> 
> My experiments find that PETSc is roughly 4 times slower on a single thread 
> and node than the kernel generated by TACO:
> 
> PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms.
> TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms.
> 
> My code using PETSc is here: 
> https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38
>  
> <https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38>.
> 
> Runs from 1 thread and 1 node with -log_view are attached to the email. The 
> command lines for each were as follows:
> 
> 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20 -warmup 10 
> -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
> 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark -n 20 
> -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view`
> 
> 
> In addition to these benchmarking concerns, I wanted to share my experiences 
> trying to load data from Matrix Market files into PETSc, which ended up 
> 1being much more difficult than I anticipated. Essentially, trying to iterate 
> through the Matrix Market files and using `write` to insert entries into a 
> `Mat` was extremely slow. In order to get reasonable performance, I had to 
> use an external utility to basically construct a CSR matrix, and then pass 
> the arrays from the CSR Matrix into `MatCreateSeqAIJWithArrays`. I couldn’t 
> find any more guidance on PETSc forums or Google, so I wanted to know if this 
> was the right way to go.
> 
> Thanks,
> 
> Rohan Yadav
> <petsc-1-node-1-thread.txt><petsc-1-node-40-threads.txt>

Reply via email to