another note; wrt BLAS optimization, Gotoh Kazushige-san always talks about "data transfer speed". Current bottleneck to BLAS implementation is data transfer rate.
That's why level1, 2 are not so accelerated. Level3-BLAS (dgemm etc) are very fast because we can reuse many data. Naively, theoretical peak performance of Core2 quad processor running at 2.4GHz is 2.4*2*4=19.2GFlops. it requires then 19.2*8=153.6 Gbytes/sec. where as DDR3 12800 memory transfers by 12.8G bytes/sec. (just quick google) Thanks -- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
