On 2009-08-13, Timothy Normand Miller wrote:
> There's something weird we encountered.  It seems that memcpy is
> really evil.  We expected PIO read performance to be bad.  It turned
> out to be far worse than we expected.  Analysis showed that within
> 512-byte regions, the order in which words are fetched is RANDOM,
> completely defeating our caching scheme.  We made the cache 4 times
> larger and dealt with that problem, but now what we find is that
> there's a surprising amount of idle time on the bus.  Just lots of
> dead cycles between transactions.  Interestingly, if we write our own
> loop that just reads 32-bit words one at a time, it doesn't affect
> performance, even though it's a hell of a lot less complicated then
> memcpy itself (which has to be to deal with byte alignment issues).
> There's nothing about the source code to memcpy that would give any
> indication as to why its read ordering is random.  Our best guess is
> that it just comes down to a consequence of using OOO processors,
> although the sequential 32-bit copy IS sequential.  We're thinking
> about writing an SSE-based copy routine.  None of the PC chipsets are
> smart enough to consolidate sequential PCI reads into bursts, but if
> they're not REALLY stupid, then an SSE load instruction might at least
> fetch four at a time.  That would increase the performance from bad to
> mediocre.  Mind you, when we add acceleration, this will only matter
> for getimage, and even then, only as long as we don't have DMA.

You could try the memory ordered operations from libatomic_ops.  I think
either using AO_load_read/AO_store_write (whichever is on the hardware
side) or a call to AO_nop_full() at each block boundary should do.  On
x86(_64) it seems to use mfence internally.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to