On 3/7/13 6:47 PM, Francesc Alted wrote: > On 3/6/13 7:42 PM, Kurt Smith wrote: >> And regarding performance, doing simple timings shows a 30%-ish >> slowdown for unaligned operations: >> >> In [36]: %timeit packed_arr['b']**2 >> 100 loops, best of 3: 2.48 ms per loop >> >> In [37]: %timeit aligned_arr['b']**2 >> 1000 loops, best of 3: 1.9 ms per loop > > Hmm, that clearly depends on the architecture. On my machine: > > In [1]: import numpy as np > > In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True) > > In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False) > > In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt) > > In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt) > > In [6]: baligned = aligned_arr['b'] > > In [7]: bpacked = packed_arr['b'] > > In [8]: %timeit baligned**2 > 1000 loops, best of 3: 1.96 ms per loop > > In [9]: %timeit bpacked**2 > 100 loops, best of 3: 7.84 ms per loop > > That is, the unaligned column is 4x slower (!). numexpr allows > somewhat better results: > > In [11]: %timeit numexpr.evaluate('baligned**2') > 1000 loops, best of 3: 1.13 ms per loop > > In [12]: %timeit numexpr.evaluate('bpacked**2') > 1000 loops, best of 3: 865 us per loop
Just for completeness, here it is what Theano gets: In [18]: import theano In [20]: a = theano.tensor.vector() In [22]: f = theano.function([a], a**2) In [23]: %timeit f(baligned) 100 loops, best of 3: 7.74 ms per loop In [24]: %timeit f(bpacked) 100 loops, best of 3: 12.6 ms per loop So yeah, Theano is also slower for the unaligned case (but less than 2x in this case). > > Yes, in this case, the unaligned array goes faster (as much as 30%). > I think the reason is that numexpr optimizes the unaligned access by > doing a copy of the different chunks in internal buffers that fits in > L1 cache. Apparently this is very beneficial in this case (not sure > why, though). > >> >> Whereas summing shows just a 10%-ish slowdown: >> >> In [38]: %timeit packed_arr['b'].sum() >> 1000 loops, best of 3: 1.29 ms per loop >> >> In [39]: %timeit aligned_arr['b'].sum() >> 1000 loops, best of 3: 1.14 ms per loop > > On my machine: > > In [14]: %timeit baligned.sum() > 1000 loops, best of 3: 1.03 ms per loop > > In [15]: %timeit bpacked.sum() > 100 loops, best of 3: 3.79 ms per loop > > Again, the 4x slowdown is here. Using numexpr: > > In [16]: %timeit numexpr.evaluate('sum(baligned)') > 100 loops, best of 3: 2.16 ms per loop > > In [17]: %timeit numexpr.evaluate('sum(bpacked)') > 100 loops, best of 3: 2.08 ms per loop And with Theano: In [26]: f2 = theano.function([a], a.sum()) In [27]: %timeit f2(baligned) 100 loops, best of 3: 2.52 ms per loop In [28]: %timeit f2(bpacked) 100 loops, best of 3: 7.43 ms per loop Again, the unaligned case is significantly slower (as much as 3x here!). -- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion