El 17/04/14 19:28, Julian Taylor ha escrit: > On 17.04.2014 18:06, Francesc Alted wrote: > >> In [4]: x_unaligned = np.zeros(shape, >> dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x'] > on arrays of this size you won't see alignment issues you are dominated > by memory bandwidth. If at all you will only see it if the data fits > into the cache. > Its also about unaligned to simd vectors not unaligned to basic types. > But it doesn't matter anymore on modern x86 cpus. I guess for array data > cache line splits should also not be a big concern.
Yes, that was my point, that in x86 CPUs this is not such a big problem. But still a factor of 2 is significant, even for CPU-intensive tasks. For example, computing sin() is affected similarly (sin() is using SIMD, right?): In [6]: shape = (10000, 10000) In [7]: x_aligned = np.zeros(shape, dtype=[('x',np.float64),('y',np.int64)])['x'] In [8]: x_unaligned = np.zeros(shape, dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x'] In [9]: %timeit res = np.sin(x_aligned) 1 loops, best of 3: 654 ms per loop In [10]: %timeit res = np.sin(x_unaligned) 1 loops, best of 3: 1.08 s per loop and again, numexpr can deal with that pretty well (using 8 physical cores here): In [6]: %timeit res = ne.evaluate('sin(x_aligned)') 10 loops, best of 3: 149 ms per loop In [7]: %timeit res = ne.evaluate('sin(x_unaligned)') 10 loops, best of 3: 151 ms per loop > Aligned allocators are not the only allocator which might be useful in > numpy. Modern CPUs also support larger pages than 4K (huge pages up to > 1GB in size) which reduces TLB cache misses. Memory of this type > typically needs to be allocated with special mmap flags, though newer > kernel versions can now also provide this memory to transparent > anonymous pages (normal non-file mmaps). That's interesting. In which scenarios do you think that could improve performance? >> In [8]: import numexpr as ne >> >> In [9]: %timeit res = ne.evaluate('x_aligned ** 2') >> 10 loops, best of 3: 133 ms per loop >> >> In [10]: %timeit res = ne.evaluate('x_unaligned ** 2') >> 10 loops, best of 3: 134 ms per loop >> >> i.e. there is not a significant difference between aligned and unaligned >> access to data. >> >> I wonder if the same technique could be applied to NumPy. > > you already can do so with relatively simple means: > http://nbviewer.ipython.org/gist/anonymous/10942132 > > If you change the blocking function to get a function as input and use > inplace operations numpy can even beat numexpr (though I used the > numexpr Ubuntu package which might not be compiled optimally) > This type of transformation can probably be applied on the AST quite easily. That's smart. Yeah, I don't see a reason why numexpr would be performing badly on Ubuntu. But I am not getting your performance for blocked_thread on my AMI linux vbox: http://nbviewer.ipython.org/gist/anonymous/11000524 oh well, threads are always tricky beasts. By the way, apparently the optimal block size for my machine is something like 1 MB, not 128 KB, although the difference is not big: http://nbviewer.ipython.org/gist/anonymous/11002751 (thanks to Stefan Van der Walt for the script). -- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion