On 12/21/12 1:35 PM, Dag Sverre Seljebotn wrote:
> On 12/20/2012 03:23 PM, Francesc Alted wrote:
>> On 12/20/12 9:53 AM, Henry Gomersall wrote:
>>> On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
>>>> The only scenario that I see that this would create unaligned arrays
>>>> is
>>>> for machines having AVX.  But provided that the Intel architecture is
>>>> making great strides in fetching unaligned data, I'd be surprised
>>>> that
>>>> the difference in performance would be even noticeable.
>>>> Can you tell us which difference in performance are you seeing for an
>>>> AVX-aligned array and other that is not AVX-aligned?  Just curious.
>>> Further to this point, from an Intel article...
>>> http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors
>>> "Aligning data to vector length is always recommended. When using Intel
>>> SSE and Intel SSE2 instructions, loaded data should be aligned to 16
>>> bytes. Similarly, to achieve best results use Intel AVX instructions on
>>> 32-byte vectors that are 32-byte aligned. The use of Intel AVX
>>> instructions on unaligned 32-byte vectors means that every second load
>>> will be across a cache-line split, since the cache line is 64 bytes.
>>> This doubles the cache line split rate compared to Intel SSE code that
>>> uses 16-byte vectors. A high cache-line split rate in memory-intensive
>>> code is extremely likely to cause performance degradation. For that
>>> reason, it is highly recommended to align the data to 32 bytes for use
>>> with Intel AVX."
>>> Though it would be nice to put together a little example of this!
>> Indeed, an example is what I was looking for.  So provided that I have
>> access to an AVX capable machine (having 6 physical cores), and that MKL
>> 10.3 has support for AVX, I have made some comparisons using the
>> Anaconda Python distribution (it ships with most packages linked against
>> MKL 10.3).
>> Here it is a first example using a DGEMM operation.  First using a NumPy
>> that is not turbo-loaded with MKL:
>> In [34]: a = np.linspace(0,1,1e7)
>> In [35]: b = a.reshape(1000, 10000)
>> In [36]: c = a.reshape(10000, 1000)
>> In [37]: time d = np.dot(b,c)
>> CPU times: user 7.56 s, sys: 0.03 s, total: 7.59 s
>> Wall time: 7.63 s
>> In [38]: time d = np.dot(c,b)
>> CPU times: user 78.52 s, sys: 0.18 s, total: 78.70 s
>> Wall time: 78.89 s
>> This is getting around 2.6 GFlop/s.  Now, with a MKL 10.3 NumPy and
>> AVX-unaligned data:
>> In [7]: p = ctypes.create_string_buffer(int(8e7)); hex(ctypes.addressof(p))
>> Out[7]: '0x7fcdef3b4010'  # 16 bytes alignment
>> In [8]: a = np.ndarray(1e7, "f8", p)
>> In [9]: a[:] = np.linspace(0,1,1e7)
>> In [10]: b = a.reshape(1000, 10000)
>> In [11]: c = a.reshape(10000, 1000)
>> In [37]: %timeit d = np.dot(b,c)
>> 10 loops, best of 3: 164 ms per loop
>> In [38]: %timeit d = np.dot(c,b)
>> 1 loops, best of 3: 1.65 s per loop
>> That is around 120 GFlop/s (i.e. almost 50x faster than without MKL/AVX).
>> Now, using MKL 10.3 and AVX-aligned data:
>> In [21]: p2 = ctypes.create_string_buffer(int(8e7+16));
>> hex(ctypes.addressof(p))
>> Out[21]: '0x7f8cb9598010'
>> In [22]: a2 = np.ndarray(1e7+2, "f8", p2)[2:]  # skip the first 16 bytes
>> (now is 32-bytes aligned)
>> In [23]: a2[:] = np.linspace(0,1,1e7)
>> In [24]: b2 = a2.reshape(1000, 10000)
>> In [25]: c2 = a2.reshape(10000, 1000)
>> In [35]: %timeit d2 = np.dot(b2,c2)
>> 10 loops, best of 3: 163 ms per loop
>> In [36]: %timeit d2 = np.dot(c2,b2)
>> 1 loops, best of 3: 1.67 s per loop
>> So, again, around 120 GFlop/s, and the difference wrt to unaligned AVX
>> data is negligible.
>> One may argue that DGEMM is CPU-bounded and that memory access plays
>> little role here, and that is certainly true.  So, let's go with a more
>> memory-bounded problem, like computing a transcendental function with
>> numexpr.  First with a with NumPy and numexpr with no MKL support:
>> In [8]: a = np.linspace(0,1,1e8)
>> In [9]: %time b = np.sin(a)
>> CPU times: user 1.20 s, sys: 0.22 s, total: 1.42 s
>> Wall time: 1.42 s
>> In [10]: import numexpr as ne
>> In [12]: %time b = ne.evaluate("sin(a)")
>> CPU times: user 1.42 s, sys: 0.27 s, total: 1.69 s
>> Wall time: 0.37 s
>> This time is around 4x faster than regular 'sin' in libc, and about the
>> same speed than a memcpy():
>> In [13]: %time c = a.copy()
>> CPU times: user 0.19 s, sys: 0.20 s, total: 0.39 s
>> Wall time: 0.39 s
>> Now, with a MKL-aware numexpr and non-AVX alignment:
>> In [8]: p = ctypes.create_string_buffer(int(8e8)); hex(ctypes.addressof(p))
>> Out[8]: '0x7fce435da010'  # 16 bytes alignment
>> In [9]: a = np.ndarray(1e8, "f8", p)
>> In [10]: a[:] = np.linspace(0,1,1e8)
>> In [11]: %time b = ne.evaluate("sin(a)")
>> CPU times: user 0.44 s, sys: 0.27 s, total: 0.71 s
>> Wall time: 0.15 s
>> That is, more than 2x faster than a memcpy() in this system, meaning
>> that the problem is truly memory-bounded.  So now, with an AVX aligned
>> buffer:
>> In [14]: a2 = a[2:]  # skip the first 16 bytes
>> In [15]: %time b = ne.evaluate("sin(a2)")
>> CPU times: user 0.40 s, sys: 0.28 s, total: 0.69 s
>> Wall time: 0.16 s
>> Again, times are very close.  Just to make sure, let's use the timeit magic:
>> In [16]: %timeit b = ne.evaluate("sin(a)")
>> 10 loops, best of 3: 159 ms per loop   # unaligned
>> In [17]: %timeit b = ne.evaluate("sin(a2)")
>> 10 loops, best of 3: 154 ms per loop   # aligned
>> All in all, it is not clear that AVX alignment would have an advantage,
>> even for memory-bounded problems.  But of course, if Intel people are
>> saying that AVX alignment is important is because they have use cases
>> for asserting this.  It is just that I'm having a difficult time to find
>> these cases.
> Hmm, I think it is the opposite, that it is for CPU-bound problems that
> alignment would have an effect? I.e. the MOVUPD would be doing some
> shuffling etc. to get around the non-alignment, which only matters if
> the data is already in cache.
> (There are other instructions, like the STREAM instructions and the
> direct writes and so on, which are much more important for the
> non-cached case. At least that's my understanding.)

Yes, I think you are right.  It is just that I was a bit disappointed 
with the DGEMM not being affected by non-AVX alignment and tried a 
memory-bound problem, just in case.  But as I said before, probably 
Intel people have dealt with both aligned and unaligned data.

Francesc Alted

