another note;
wrt BLAS optimization, Gotoh Kazushige-san always talks about
"data transfer speed". Current bottleneck to BLAS implementation is data 
transfer rate.

That's why level1, 2 are not so accelerated. Level3-BLAS (dgemm etc) are very
fast because we can reuse many data.

Naively, theoretical peak performance
of Core2 quad processor running at 2.4GHz is 2.4*2*4=19.2GFlops.
it requires then 19.2*8=153.6 Gbytes/sec.
where as DDR3 12800 memory transfers by 12.8G bytes/sec. (just quick google)

Thanks
-- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ 
   Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to