* Denys Vlasenko <[email protected]> wrote:

> I was thinking about Ingo's AMD results:
> 
> linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time 
> elapsed
> linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time 
> elapsed
> linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time 
> elapsed
> 
> AMD is almost perfect. Having no alignment at all still works very 
> well. [...]

Not quite. As I mentioned it in my post, the 'time elapsed' numbers 
were very noisy in the AMD case - and you've cut off the stddev column 
that shows this. Here is the full data:

 linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time 
elapsed                                          ( +-  2.74% )
 linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time 
elapsed                                          ( +-  1.84% )
 linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time 
elapsed                                          ( +-  2.15% )

2-3% of stddev for a 3.7% speedup is not conclusive.

What you should use instead is the cachemiss counts, which is a good 
proxy and a lot more stable statistically:

 linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      
L1-icache-load-misses                                         ( +-  0.10% )  
(100.00%)
 linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      
L1-icache-load-misses                                         ( +-  0.18% )  
(100.00%)
 linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      
L1-icache-load-misses                                         ( +-  0.17% )  
(100.00%)

which shows that 64 bytes alignment still generates a better I$ layout 
than tight packing, resulting in 4.3% fewer I$ misses.

On Intel it's more pronounced:

 linux-falign-functions=_64-bytes/res.txt:        647,853,942      
L1-icache-load-misses                                         ( +-  0.07% )  
(100.00%)
 linux-falign-functions=__1-bytes/res.txt:        724,539,055      
L1-icache-load-misses                                         ( +-  0.31% )  
(100.00%)

12% difference. Note that the Intel workload is running on SSDs which 
makes the cache footprint several times larger, and the workload is 
more realistic as well than the AMD test that was running in tmpfs.

I think it's a fair bet to assume that the AMD system will show a 
similar difference if it were to run the same workload.

Allowing smaller functions to be cut in half by cacheline boundaries 
looks like a losing strategy, especially with larger workloads.

The modified scheme I suggested: 64 bytes alignment + intelligent 
packing might do even better than dumb 64 bytes alignment.

Thanks,

        Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to