On 16.07.12 09:22, Skybuck Flying wrote: > I also wonder how much of an optimization it actually is ? Maybe > 0.000001% more performance ?
Cache related optimizations are VERY hard to measure and depend on overall context and used architecture. But as the L1-cache is one of the most performance critical parts in these days cpus the gains of working with cache friendly structures should not be underestimated. There a a couple of things that need to be taken into account. 1.) Cacheline Utilization: Packing together multiple smaller items into single (machine-)words allows for better utilization of precious cache space. As L1-DCache is usually only 16-32kbytes these days, every byte counts, because saving a single byte can make a difference between using one or two cache lines, which in return will save memory bandwidth and save you from a cache-miss related stall down the line. 2.) Cacheline Streaming: unless your memory bus is a wide as your cacheline it takes multiple cycles to fetch a whole cacheline. The pipeline has to stall till the data in question arrives. If you have relevant data in the end of a cacheline you will have to wait for the whole transaction to complete. So it makes sense to order fields by occurence of access so your first miss will hopefully only lead to the minimal stall time. While modern CPUs can circumvent/hide some of the cache miss latency with the help of prefetching, out-of-order execution and Hyperthreading it will still lead to a performance penalty. For CPUs without these features (like most ARM cores) this penalty can become substantial, leading to 100 and more stall cycles for a cache-miss. Nico _______________________________________________ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel