Don <> changed:

           What    |Removed                     |Added
                 CC|                            |

--- Comment #4 from Don <> 2011-04-28 21:36:38 PDT ---
Did you test this on Intel, or AMD? Blas1 code is generally limited by memory
access, and AMD has two load ports, so it has different bottlenecks and in
these operations always does better than Intel.

See also a discussion at:
(I've talked to Eric Bealto before, he's happy for Phobos to use any of his
stuff if we see anything we want). It's a bit misleading, though, because above
a certain length, you become dominated by cache effects, so I don't know if
unrolling by 4 is actually worthwhile in practice.

I also have some optimized x87 dot product code (AMD 32 bit CPUs don't have
SSE2, so they still need x87 for doubles).

Configure issuemail:
------- You are receiving this mail because: -------

Reply via email to