On 2009-08-13, Timothy Normand Miller wrote: > There's something weird we encountered. It seems that memcpy is > really evil. We expected PIO read performance to be bad. It turned > out to be far worse than we expected. Analysis showed that within > 512-byte regions, the order in which words are fetched is RANDOM, > completely defeating our caching scheme. We made the cache 4 times > larger and dealt with that problem, but now what we find is that > there's a surprising amount of idle time on the bus. Just lots of > dead cycles between transactions. Interestingly, if we write our own > loop that just reads 32-bit words one at a time, it doesn't affect > performance, even though it's a hell of a lot less complicated then > memcpy itself (which has to be to deal with byte alignment issues). > There's nothing about the source code to memcpy that would give any > indication as to why its read ordering is random. Our best guess is > that it just comes down to a consequence of using OOO processors, > although the sequential 32-bit copy IS sequential. We're thinking > about writing an SSE-based copy routine. None of the PC chipsets are > smart enough to consolidate sequential PCI reads into bursts, but if > they're not REALLY stupid, then an SSE load instruction might at least > fetch four at a time. That would increase the performance from bad to > mediocre. Mind you, when we add acceleration, this will only matter > for getimage, and even then, only as long as we don't have DMA.
You could try the memory ordered operations from libatomic_ops. I think either using AO_load_read/AO_store_write (whichever is on the hardware side) or a call to AO_nop_full() at each block boundary should do. On x86(_64) it seems to use mfence internally. _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
