On 12/21/12 1:35 PM, Dag Sverre Seljebotn wrote: > On 12/20/2012 03:23 PM, Francesc Alted wrote: >> On 12/20/12 9:53 AM, Henry Gomersall wrote: >>> On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote: >>>> The only scenario that I see that this would create unaligned arrays >>>> is >>>> for machines having AVX. But provided that the Intel architecture is >>>> making great strides in fetching unaligned data, I'd be surprised >>>> that >>>> the difference in performance would be even noticeable. >>>> >>>> Can you tell us which difference in performance are you seeing for an >>>> AVX-aligned array and other that is not AVX-aligned? Just curious. >>> Further to this point, from an Intel article... >>> >>> http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors >>> >>> "Aligning data to vector length is always recommended. When using Intel >>> SSE and Intel SSE2 instructions, loaded data should be aligned to 16 >>> bytes. Similarly, to achieve best results use Intel AVX instructions on >>> 32-byte vectors that are 32-byte aligned. The use of Intel AVX >>> instructions on unaligned 32-byte vectors means that every second load >>> will be across a cache-line split, since the cache line is 64 bytes. >>> This doubles the cache line split rate compared to Intel SSE code that >>> uses 16-byte vectors. A high cache-line split rate in memory-intensive >>> code is extremely likely to cause performance degradation. For that >>> reason, it is highly recommended to align the data to 32 bytes for use >>> with Intel AVX." >>> >>> Though it would be nice to put together a little example of this! >> Indeed, an example is what I was looking for. So provided that I have >> access to an AVX capable machine (having 6 physical cores), and that MKL >> 10.3 has support for AVX, I have made some comparisons using the >> Anaconda Python distribution (it ships with most packages linked against >> MKL 10.3). >> >> Here it is a first example using a DGEMM operation. First using a NumPy >> that is not turbo-loaded with MKL: >> >> In [34]: a = np.linspace(0,1,1e7) >> >> In [35]: b = a.reshape(1000, 10000) >> >> In [36]: c = a.reshape(10000, 1000) >> >> In [37]: time d = np.dot(b,c) >> CPU times: user 7.56 s, sys: 0.03 s, total: 7.59 s >> Wall time: 7.63 s >> >> In [38]: time d = np.dot(c,b) >> CPU times: user 78.52 s, sys: 0.18 s, total: 78.70 s >> Wall time: 78.89 s >> >> This is getting around 2.6 GFlop/s. Now, with a MKL 10.3 NumPy and >> AVX-unaligned data: >> >> In [7]: p = ctypes.create_string_buffer(int(8e7)); hex(ctypes.addressof(p)) >> Out[7]: '0x7fcdef3b4010' # 16 bytes alignment >> >> In [8]: a = np.ndarray(1e7, "f8", p) >> >> In [9]: a[:] = np.linspace(0,1,1e7) >> >> In [10]: b = a.reshape(1000, 10000) >> >> In [11]: c = a.reshape(10000, 1000) >> >> In [37]: %timeit d = np.dot(b,c) >> 10 loops, best of 3: 164 ms per loop >> >> In [38]: %timeit d = np.dot(c,b) >> 1 loops, best of 3: 1.65 s per loop >> >> That is around 120 GFlop/s (i.e. almost 50x faster than without MKL/AVX). >> >> Now, using MKL 10.3 and AVX-aligned data: >> >> In [21]: p2 = ctypes.create_string_buffer(int(8e7+16)); >> hex(ctypes.addressof(p)) >> Out[21]: '0x7f8cb9598010' >> >> In [22]: a2 = np.ndarray(1e7+2, "f8", p2)[2:] # skip the first 16 bytes >> (now is 32-bytes aligned) >> >> In [23]: a2[:] = np.linspace(0,1,1e7) >> >> In [24]: b2 = a2.reshape(1000, 10000) >> >> In [25]: c2 = a2.reshape(10000, 1000) >> >> In [35]: %timeit d2 = np.dot(b2,c2) >> 10 loops, best of 3: 163 ms per loop >> >> In [36]: %timeit d2 = np.dot(c2,b2) >> 1 loops, best of 3: 1.67 s per loop >> >> So, again, around 120 GFlop/s, and the difference wrt to unaligned AVX >> data is negligible. >> >> One may argue that DGEMM is CPU-bounded and that memory access plays >> little role here, and that is certainly true. So, let's go with a more >> memory-bounded problem, like computing a transcendental function with >> numexpr. First with a with NumPy and numexpr with no MKL support: >> >> In [8]: a = np.linspace(0,1,1e8) >> >> In [9]: %time b = np.sin(a) >> CPU times: user 1.20 s, sys: 0.22 s, total: 1.42 s >> Wall time: 1.42 s >> >> In [10]: import numexpr as ne >> >> In [12]: %time b = ne.evaluate("sin(a)") >> CPU times: user 1.42 s, sys: 0.27 s, total: 1.69 s >> Wall time: 0.37 s >> >> This time is around 4x faster than regular 'sin' in libc, and about the >> same speed than a memcpy(): >> >> In [13]: %time c = a.copy() >> CPU times: user 0.19 s, sys: 0.20 s, total: 0.39 s >> Wall time: 0.39 s >> >> Now, with a MKL-aware numexpr and non-AVX alignment: >> >> In [8]: p = ctypes.create_string_buffer(int(8e8)); hex(ctypes.addressof(p)) >> Out[8]: '0x7fce435da010' # 16 bytes alignment >> >> In [9]: a = np.ndarray(1e8, "f8", p) >> >> In [10]: a[:] = np.linspace(0,1,1e8) >> >> In [11]: %time b = ne.evaluate("sin(a)") >> CPU times: user 0.44 s, sys: 0.27 s, total: 0.71 s >> Wall time: 0.15 s >> >> That is, more than 2x faster than a memcpy() in this system, meaning >> that the problem is truly memory-bounded. So now, with an AVX aligned >> buffer: >> >> In [14]: a2 = a[2:] # skip the first 16 bytes >> >> In [15]: %time b = ne.evaluate("sin(a2)") >> CPU times: user 0.40 s, sys: 0.28 s, total: 0.69 s >> Wall time: 0.16 s >> >> Again, times are very close. Just to make sure, let's use the timeit magic: >> >> In [16]: %timeit b = ne.evaluate("sin(a)") >> 10 loops, best of 3: 159 ms per loop # unaligned >> >> In [17]: %timeit b = ne.evaluate("sin(a2)") >> 10 loops, best of 3: 154 ms per loop # aligned >> >> All in all, it is not clear that AVX alignment would have an advantage, >> even for memory-bounded problems. But of course, if Intel people are >> saying that AVX alignment is important is because they have use cases >> for asserting this. It is just that I'm having a difficult time to find >> these cases. > Hmm, I think it is the opposite, that it is for CPU-bound problems that > alignment would have an effect? I.e. the MOVUPD would be doing some > shuffling etc. to get around the non-alignment, which only matters if > the data is already in cache. > > (There are other instructions, like the STREAM instructions and the > direct writes and so on, which are much more important for the > non-cached case. At least that's my understanding.)
Yes, I think you are right. It is just that I was a bit disappointed with the DGEMM not being affected by non-AVX alignment and tried a memory-bound problem, just in case. But as I said before, probably Intel people have dealt with both aligned and unaligned data. -- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion