Don wrote:
The next D2 runtime will include my cache-size detection code. This
makes it possible to write a cache-aware memcpy, using (for example)
non-temporal writes when the arrays being copied exceed the size of the
largest cache.
In my tests, it gives a speed-up of approximately 2X in such cases.
The downside is, it's a fair bit of work to implement, and it only
affects extremely large arrays, so I fear it's basically irrelevant (It
probably won't help arrays < 32K in size). Do people actually copy
megabyte-sized arrays?
Is it worth spending any more time on it?
BTW: I tested the memcpy() code provided in AMD's 1992 optimisation
manual, and in Intel's 2007 manual. Only one of them actually gave any
benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't
Intel!)
I've noticed that AMD's docs are usually greatly superior to Intels, but
this time the difference is unbelievable.
What's the alternative? What would you do instead? Is there something
cooler or more important for D to do?
(IMHO, if the other alternatives have any merit, then I'd vote for them.)
But then again, you've already invested in this, and it clearly
interests you. Labourious, yes, but it sounds fun.