James Philbin wrote: > OK, i've written a simple benchmark which implements an elementwise > multiply (A=B*C) in three different ways (standard C, intrinsics, hand > coded assembly). On the face of things the results seem to indicate > that the vectorization works best on medium sized inputs. If people > could post the results of running the benchmark on their machines > (takes ~1min) along with the output of gcc --version and their chip > model, that wd be v useful. > > It should be compiled with: gcc -msse -O2 vec_bench.c -o vec_bench >
CPU: Intel(R) Core(TM)2 CPU T7400 @ 2.16GHz (macbook, intel core 2 duo) gcc (GCC) 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2) (ubuntu gutsy gibbon 7.10) $ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0003ms (100.0%) 0.0002ms ( 68.3%) 0.0002ms ( 75.6%) 1000 0.0023ms (100.0%) 0.0018ms ( 76.7%) 0.0020ms ( 87.1%) 10000 0.0361ms (100.0%) 0.0193ms ( 53.4%) 0.0338ms ( 93.7%) 100000 0.2839ms (100.0%) 0.1351ms ( 47.6%) 0.0937ms ( 33.0%) 1000000 4.2108ms (100.0%) 4.1234ms ( 97.9%) 4.0886ms ( 97.1%) 10000000 45.3192ms (100.0%) 45.5359ms (100.5%) 45.3466ms (100.1%) Note that there is some variance in the results. Here is a second run to have an idea (look at Inline, size=10000): $ ./vec_bench Testing methods... All OK Problem size Simple Intrin Inline 100 0.0003ms (100.0%) 0.0002ms ( 69.5%) 0.0002ms ( 74.1%) 1000 0.0024ms (100.0%) 0.0018ms ( 75.9%) 0.0020ms ( 86.4%) 10000 0.0324ms (100.0%) 0.0186ms ( 57.3%) 0.0226ms ( 69.6%) 100000 0.2840ms (100.0%) 0.1171ms ( 41.2%) 0.0939ms ( 33.1%) 1000000 4.4034ms (100.0%) 4.3657ms ( 99.1%) 4.0465ms ( 91.9%) 10000000 44.4854ms (100.0%) 43.9502ms ( 98.8%) 43.6824ms ( 98.2%) HTH Emanuele _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion