On Fri, Dec 10, 2021 at 8:05 PM Rohan Yadav <roh...@alumni.cmu.edu> wrote:
> Hi, I’m Rohan, a student working on compilation techniques for distributed > tensor computations. I’m looking at using PETSc as a baseline for > experiments I’m running, and want to understand if I’m using PETSc as it > was intended to achieve high performance, and if the performance I’m seeing > is expected. Currently, I’m just looking at SpMV operations. > > > My experiments are run on the Lassen Supercomputer ( > https://hpc.llnl.gov/hardware/platforms/lassen). The system has 40 CPUs, > 4 V100s and an Infiniband interconnect. A visualization of the architecture > is here: > https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png. > > > As of now, I’m trying to understand the single-node performance of PETSc, > as the scaling performance onto multiple nodes appears to be as I expect. > I’m using the arabic-2005 sparse matrix from the SuiteSparse matrix > collection, detailed here: https://sparse.tamu.edu/LAW/arabic-2005. As a > trusted baseline, I am comparing against SpMV code generated by the TACO > compiler ( > http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races) > . > I don't know what "No Races" means, but it seems you'd better also verify the result of SpMV. > > My experiments find that PETSc is roughly 4 times slower on a single > thread and node than the kernel generated by TACO: > > > PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms. > > TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms. > You can think petsc's default CSR spmv is the baseline, which is done in ~10 lines of code. > > My code using PETSc is here: > https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38 > . > > > Runs from 1 thread and 1 node with -log_view are attached to the email. > The command lines for each were as follows: > > > 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20 -warmup > 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view` > > 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark -n 20 > -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view` > > > > In addition to these benchmarking concerns, I wanted to share my > experiences trying to load data from Matrix Market files into PETSc, which > ended up 1being much more difficult than I anticipated. Essentially, trying > to iterate through the Matrix Market files and using `write` to insert > entries into a `Mat` was extremely slow. In order to get reasonable > performance, I had to use an external utility to basically construct a CSR > matrix, and then pass the arrays from the CSR Matrix into > `MatCreateSeqAIJWithArrays`. I couldn’t find any more guidance on PETSc > forums or Google, so I wanted to know if this was the right way to go. > > > Thanks, > > > Rohan Yadav >