Uh, 15x slower for unaligned access is quite a lot. But Intel (and AMD)
arquitectures are much more tolerant in this aspect (and improving).
For example, with a Xeon(R) CPU E5-2670 (2 years old) I get:
In [1]: import numpy as np
In [2]: shape = (10000, 10000)
In [3]: x_aligned = np.zeros(shape,
dtype=[('x',np.float64),('y',np.int64)])['x']
In [4]: x_unaligned = np.zeros(shape,
dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']
In [5]: %timeit res = x_aligned ** 2
1 loops, best of 3: 289 ms per loop
In [6]: %timeit res = x_unaligned ** 2
1 loops, best of 3: 664 ms per loop
so the added cost in this case is just a bit more than 2x. But you can
also alleviate this overhead if you do a copy that fits in cache prior
to do computations. numexpr does this:
https://github.com/pydata/numexpr/blob/master/numexpr/interp_body.cpp#L203
and the results are pretty good:
In [8]: import numexpr as ne
In [9]: %timeit res = ne.evaluate('x_aligned ** 2')
10 loops, best of 3: 133 ms per loop
In [10]: %timeit res = ne.evaluate('x_unaligned ** 2')
10 loops, best of 3: 134 ms per loop
i.e. there is not a significant difference between aligned and unaligned
access to data.
I wonder if the same technique could be applied to NumPy.
Francesc
El 17/04/14 16:26, Aron Ahmadia ha escrit:
Hmnn, I wasn't being clear :)
The default malloc on BlueGene/Q only returns 8 byte alignment, but
the SIMD units need 32-byte alignment for loads, stores, and
operations or performance suffers. On the /P the required alignment
was 16-bytes, but malloc only gave you 8, and trying to perform
vectorized loads/stores generated alignment exceptions on unaligned
memory.
See https://wiki.alcf.anl.gov/parts/index.php/Blue_Gene/Q and
https://computing.llnl.gov/tutorials/bgp/BGP-usage.Walkup.pdf (slides
14 for overview, 15 for the effective performance difference between
the unaligned/aligned code) for some notes on this.
A
On Thu, Apr 17, 2014 at 10:18 AM, Nathaniel Smith <n...@pobox.com
<mailto:n...@pobox.com>> wrote:
On 17 Apr 2014 15:09, "Aron Ahmadia" <a...@ahmadia.net
<mailto:a...@ahmadia.net>> wrote:
>
> > On the one hand it would be nice to actually know whether
posix_memalign is important, before making api decisions on this
basis.
>
> FWIW: On the lightweight IBM cores that the extremely popular
BlueGene machines were based on, accessing unaligned memory raised
system faults. The default behavior of these machines was to
terminate the program if more than 1000 such errors occurred on a
given process, and an environment variable allowed you to
terminate the program if *any* unaligned memory access occurred.
This is because unaligned memory accesses were 15x (or more)
slower than aligned memory access.
>
> The newer /Q chips seem to be a little more forgiving of this,
but I think one can in general expect allocated memory alignment
to be an important performance technique for future high
performance computing architectures.
Right, this much is true on lots of architectures, and so malloc
is careful to always return values with sufficient alignment (e.g.
8 bytes) to make sure that any standard operation can succeed.
The question here is whether it will be important to have *even
more* alignment than malloc gives us by default. A 16 or 32 byte
wide SIMD instruction might prefer that data have 16 or 32 byte
alignment, even if normal memory access for the types being
operated on only requires 4 or 8 byte alignment.
-n
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org>
http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion