On 3/6/13 7:42 PM, Kurt Smith wrote: > And regarding performance, doing simple timings shows a 30%-ish > slowdown for unaligned operations: > > In [36]: %timeit packed_arr['b']**2 > 100 loops, best of 3: 2.48 ms per loop > > In [37]: %timeit aligned_arr['b']**2 > 1000 loops, best of 3: 1.9 ms per loop
Hmm, that clearly depends on the architecture. On my machine: In [1]: import numpy as np In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True) In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False) In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt) In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt) In [6]: baligned = aligned_arr['b'] In [7]: bpacked = packed_arr['b'] In [8]: %timeit baligned**2 1000 loops, best of 3: 1.96 ms per loop In [9]: %timeit bpacked**2 100 loops, best of 3: 7.84 ms per loop That is, the unaligned column is 4x slower (!). numexpr allows somewhat better results: In [11]: %timeit numexpr.evaluate('baligned**2') 1000 loops, best of 3: 1.13 ms per loop In [12]: %timeit numexpr.evaluate('bpacked**2') 1000 loops, best of 3: 865 us per loop Yes, in this case, the unaligned array goes faster (as much as 30%). I think the reason is that numexpr optimizes the unaligned access by doing a copy of the different chunks in internal buffers that fits in L1 cache. Apparently this is very beneficial in this case (not sure why, though). > > Whereas summing shows just a 10%-ish slowdown: > > In [38]: %timeit packed_arr['b'].sum() > 1000 loops, best of 3: 1.29 ms per loop > > In [39]: %timeit aligned_arr['b'].sum() > 1000 loops, best of 3: 1.14 ms per loop On my machine: In [14]: %timeit baligned.sum() 1000 loops, best of 3: 1.03 ms per loop In [15]: %timeit bpacked.sum() 100 loops, best of 3: 3.79 ms per loop Again, the 4x slowdown is here. Using numexpr: In [16]: %timeit numexpr.evaluate('sum(baligned)') 100 loops, best of 3: 2.16 ms per loop In [17]: %timeit numexpr.evaluate('sum(bpacked)') 100 loops, best of 3: 2.08 ms per loop Again, the unaligned case is (sligthly better). In this case numexpr is a bit slower that NumPy because sum() is not parallelized internally. Hmm, provided that, I'm wondering if some internal copies to L1 in NumPy could help improving unaligned performance. Worth a try? -- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion