Jed: 



I just implemented the basic frame of the BSTRM and SBTRM into PETSc. It works 
not bad on IBM chips, since IBM power chip has a hardware piece called 
prefetching eninge to hanlde?multiple data prefetching streams. The following 
data shows some initial tests of SpMV on a IBM Power7 machine with one memory 
controller. You can get the "cfd.2.10" from PETSc group. 



The?efficiency?of the format depends on the enough cache size and memory 
bandwidth, power bus rate, and etc. We didn't test it on many Intel and AMD 
chips yet, although we like to if we can fin d more machines. I will add in 
more functions when I have time. If you like, you can add in more functions 
into it yourself and make it better.? 



Thanks, 



Dahai 



MATRIX: cfd.2.10 with bs = 5 (10 times with warm-up cache) 



MPI = 1 


--- dt1_BAIJ, dt2_BSTRM = 48726, 28774, R = 1 .69 
--- dt1_SBAIJ, dt2_SBSTRM = 48726, 21365, R = 2 .28 



MPI = 2 
--- dt1_BAIJ, dt2_BSTRM = 26877, 16321, R = 1 .65 
--- dt1_SBAIJ, dt2_SBSTRM = 26877, 15032, R = 1 .79 



MPI = 4 
--- dt1_BAIJ, dt2_BSTRM = 14978, 10631, R = 1 .41 
--- dt1_SBAIJ, dt2_SBSTRM = 14978, 9109, R = 1 .64 



MPI = 8 
--- dt1_BAIJ, dt2_BSTRM = 9071, 9738, R = 0 .93 (-- not sure why, maybe it is 
because this P7 chip only has one memory controller ) 
--- dt1_SBAIJ, dt2_SBSTRM = 9174, 6329, R = 1 .45 







----- Original Message -----


From: "Jed Brown" <[email protected]> 
To: "For users of the development version of PETSc" <petsc-dev at mcs.anl.gov> 
Cc: "Dahai Guo" <dhguo at ncsa.uiuc.edu> 
Sent: Monday, May 9, 2011 8:55:47 AM 
Subject: (S)BSTRM implementations for block sizes other than 4 and 5? 

I was curious to try a benchmark, but don't have a problem with these block 
sizes handy. Are other block sizes planned? Does someone have benchmarks 
against current (S)BAIJ implementations (with software prefetch)? I've seen the 
HPCA paper from Guo and Gropp, but I think that work was done before BAIJ had 
software prefetch, but also perhaps with a version of BSTRM that did not 
software prefetch, so I wonder how they compare now. Also, how is the 
performance for multiple processes per socket on Intel and AMD?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110509/870b1937/attachment.html>

Reply via email to