Hi Junchao,

I want to evaluate MatMult on GPU.  I took a 2M x 2M matrix and ran with 6 mpi ranks and 6 GPUs.  It took about 0.9 seconds.

How many nonzeros per row? With 0.9 seconds you should either have many runs of MatMult, or a fairly dense matrix; or a really slow MatMult kernel ;-)

A 2M-by-2M matrix for a 5-point stencil is probably still on the small side (I'm assuming that you run 2M-by-2M for *each* GPU), but should suffice. Expect that communication cost are significant (i.e. the bookkeeping and data exchange between GPUs is on the order of the costs for running the MatMult kernel for the respective diagonal block).


A kernel launch or a stream synchronization took about 10us.  Compared with MatMult, they are tiny. Does it mean we can ignore them?  What is a proper size to evaluate MatMult?  I heard it is a few thousand rows per MPI rank.  Why?

That would be a typical strong scaling limit for a CPU-based run a well-tuned BlueGene-type system. With GPUs you will probably need at least 100k unknowns (or ~1M nonzeros) per rank in the strong scaling limit. Add a factor of ~10 to make latency costs small in comparison.

Best regards,
Karli

Reply via email to