On Sat, Mar 22, 2008 at 5:03 PM, James Philbin <[EMAIL PROTECTED]> wrote:
> OK, i've written a simple benchmark which implements an elementwise > multiply (A=B*C) in three different ways (standard C, intrinsics, hand > coded assembly). On the face of things the results seem to indicate > that the vectorization works best on medium sized inputs. If people > could post the results of running the benchmark on their machines > (takes ~1min) along with the output of gcc --version and their chip > model, that wd be v useful. > > It should be compiled with: gcc -msse -O2 vec_bench.c -o vec_bench > > Here's two: > > CPU: Core Duo T2500 @ 2GHz > gcc --version: gcc (GCC) 4.1.2 (Ubuntu 4.1.2-0ubuntu4) > Problem size Simple Intrin > Inline > 100 0.0003ms (100.0%) 0.0002ms ( 67.7%) 0.0002ms ( > 50.6%) > 1000 0.0030ms (100.0%) 0.0021ms ( 69.2%) 0.0015ms ( > 50.6%) > 10000 0.0370ms (100.0%) 0.0267ms ( 72.0%) 0.0279ms ( > 75.4%) > 100000 0.2258ms (100.0%) 0.1469ms ( 65.0%) 0.1273ms ( > 56.4%) > 1000000 4.5690ms (100.0%) 4.4616ms ( 97.6%) 4.4185ms ( > 96.7%) > 10000000 47.0022ms (100.0%) 45.4100ms ( 96.6%) 44.4437ms ( > 94.6%) > > CPU: Intel Xeon E5345 @ 2.33Ghz > gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33) > Problem size Simple Intrin > Inline > 100 0.0001ms (100.0%) 0.0001ms ( 69.2%) 0.0001ms ( > 77.4%) > 1000 0.0010ms (100.0%) 0.0008ms ( 78.1%) 0.0009ms ( > 86.6%) > 10000 0.0108ms (100.0%) 0.0088ms ( 81.2%) 0.0086ms ( > 79.6%) > 100000 0.1131ms (100.0%) 0.0897ms ( 79.3%) 0.0872ms ( > 77.1%) > 1000000 5.2103ms (100.0%) 3.9153ms ( 75.1%) 3.8328ms ( > 73.6%) > 10000000 54.1815ms (100.0%) 51.8286ms ( 95.7%) 51.4366ms ( > 94.9%) > gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33) cpu: Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Problem size Simple Intrin Inline 100 0.0002ms (100.0%) 0.0001ms ( 68.7%) 0.0001ms ( 74.8%) 1000 0.0015ms (100.0%) 0.0011ms ( 72.0%) 0.0012ms ( 80.4%) 10000 0.0154ms (100.0%) 0.0111ms ( 72.1%) 0.0122ms ( 79.1%) 100000 0.1081ms (100.0%) 0.0759ms ( 70.2%) 0.0811ms ( 75.0%) 1000000 2.7778ms (100.0%) 2.8172ms (101.4%) 2.7929ms ( 100.5%) 10000000 28.1577ms (100.0%) 28.7332ms (102.0%) 28.4669ms ( 101.1%) It looks like memory access is the bottleneck, otherwise running 4 floats through in parallel should go a lot faster. I need to modify the program a bit and see how it works for doubles. Chuck
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion