On Thu, 2012-12-20 at 15:23 +0100, Francesc Alted wrote: > On 12/20/12 9:53 AM, Henry Gomersall wrote: > > On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote: > >> The only scenario that I see that this would create unaligned > arrays > >> is > >> for machines having AVX. But provided that the Intel architecture > is > >> making great strides in fetching unaligned data, I'd be surprised > >> that > >> the difference in performance would be even noticeable. > >> > >> Can you tell us which difference in performance are you seeing for > an > >> AVX-aligned array and other that is not AVX-aligned? Just curious. > > Further to this point, from an Intel article... > > > > > http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors > > > > "Aligning data to vector length is always recommended. When using > Intel > > SSE and Intel SSE2 instructions, loaded data should be aligned to 16 > > bytes. Similarly, to achieve best results use Intel AVX instructions > on > > 32-byte vectors that are 32-byte aligned. The use of Intel AVX > > instructions on unaligned 32-byte vectors means that every second > load > > will be across a cache-line split, since the cache line is 64 bytes. > > This doubles the cache line split rate compared to Intel SSE code > that > > uses 16-byte vectors. A high cache-line split rate in > memory-intensive > > code is extremely likely to cause performance degradation. For that > > reason, it is highly recommended to align the data to 32 bytes for > use > > with Intel AVX." > > > > Though it would be nice to put together a little example of this! > > Indeed, an example is what I was looking for. So provided that I > have > access to an AVX capable machine (having 6 physical cores), and that > MKL > 10.3 has support for AVX, I have made some comparisons using the > Anaconda Python distribution (it ships with most packages linked > against > MKL 10.3).
<snip> > All in all, it is not clear that AVX alignment would have an > advantage, > even for memory-bounded problems. But of course, if Intel people are > saying that AVX alignment is important is because they have use cases > for asserting this. It is just that I'm having a difficult time to > find > these cases. Thanks for those examples, they were very interesting. I managed to temporarily get my hands on a machine with AVX and I have shown some speed-up with aligned arrays. FFT (using my wrappers) gives about a 15% speedup. Also this convolution code: https://github.com/hgomersall/SSE-convolution/blob/master/convolve.c Shows a small but repeatable speed-up (a few %) when using some aligned loads (as many as I can work out to use!). Cheers, Henry _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion