El 17/04/14 21:19, Julian Taylor ha escrit: > On 17.04.2014 20:30, Francesc Alted wrote: >> El 17/04/14 19:28, Julian Taylor ha escrit: >>> On 17.04.2014 18:06, Francesc Alted wrote: >>> >>>> In [4]: x_unaligned = np.zeros(shape, >>>> dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x'] >>> on arrays of this size you won't see alignment issues you are dominated >>> by memory bandwidth. If at all you will only see it if the data fits >>> into the cache. >>> Its also about unaligned to simd vectors not unaligned to basic types. >>> But it doesn't matter anymore on modern x86 cpus. I guess for array data >>> cache line splits should also not be a big concern. >> Yes, that was my point, that in x86 CPUs this is not such a big >> problem. But still a factor of 2 is significant, even for CPU-intensive >> tasks. For example, computing sin() is affected similarly (sin() is >> using SIMD, right?): >> >> In [6]: shape = (10000, 10000) >> >> In [7]: x_aligned = np.zeros(shape, >> dtype=[('x',np.float64),('y',np.int64)])['x'] >> >> In [8]: x_unaligned = np.zeros(shape, >> dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x'] >> >> In [9]: %timeit res = np.sin(x_aligned) >> 1 loops, best of 3: 654 ms per loop >> >> In [10]: %timeit res = np.sin(x_unaligned) >> 1 loops, best of 3: 1.08 s per loop >> >> and again, numexpr can deal with that pretty well (using 8 physical >> cores here): >> >> In [6]: %timeit res = ne.evaluate('sin(x_aligned)') >> 10 loops, best of 3: 149 ms per loop >> >> In [7]: %timeit res = ne.evaluate('sin(x_unaligned)') >> 10 loops, best of 3: 151 ms per loop > in this case the unaligned triggers a strided memcpy calling loop to > copy the data into a aligned buffer which is terrible for performance, > even compared to the expensive sin call. > numexpr handles this well as it allows the compiler to replace the > memcpy with inline assembly (a mov instruction). > We could fix that in numpy, though I don't consider it very important, > you usually always have base type aligned memory.
Well, that *could* be important for evaluating conditions in structured arrays, as it is pretty easy to get unaligned 'columns'. But apparently this does not affect very much to numpy: In [23]: na_aligned = np.fromiter((("", i, i*2) for i in xrange(N)), dtype="S16,i4,i8") In [24]: na_unaligned = np.fromiter((("", i, i*2) for i in xrange(N)), dtype="S15,i4,i8") In [25]: %time sum((r['f1'] for r in na_aligned[na_aligned['f2'] > 10])) CPU times: user 10.2 s, sys: 93 ms, total: 10.3 s Wall time: 10.3 s Out[25]: 49999994999985 In [26]: %time sum((r['f1'] for r in na_unaligned[na_unaligned['f2'] > 10])) CPU times: user 10.2 s, sys: 82 ms, total: 10.3 s Wall time: 10.3 s Out[26]: 49999994999985 probably because the bottleneck is in another place. So yeah, probably not worth to worry about that. > (sin is not a SIMD using function unless you use a vector math library > not supported by numpy directly yet) Ah, so MKL is making use of SIMD for computing the sin(), but not in general. But you later said that numpy's sqrt *is* making use of SIMD. I wonder why. > >> >>> Aligned allocators are not the only allocator which might be useful in >>> numpy. Modern CPUs also support larger pages than 4K (huge pages up to >>> 1GB in size) which reduces TLB cache misses. Memory of this type >>> typically needs to be allocated with special mmap flags, though newer >>> kernel versions can now also provide this memory to transparent >>> anonymous pages (normal non-file mmaps). >> That's interesting. In which scenarios do you think that could improve >> performance? > it might improve all numpy operations dealing with big arrays. > big arrays trigger many large temporaries meaning glibc uses mmap > meaning lots of moving of address space between the kernel and userspace. > but I haven't benchmarked it, so it could also be completely irrelevant. I was curious about this and apparently the speedups that typically bring large page caches is around 5%: http://stackoverflow.com/questions/14275170/performance-degradation-with-large-pages not a big deal, but it is something. > > Also memory fragments really fast, so after a few hours of operation you > often can't allocate any huge pages anymore, so you need to reserve > space for them which requires special setup of machines. > > Another possibility for special allocators are numa allocators that > ensure you get memory local to a specific compute node regardless of the > system numa policy. > But again its probably not very important as python has poor thread > scalability anyway, these are just examples for keeping flexibility of > our allocators in numpy and not binding us to what python does. Agreed. > That's smart. Yeah, I don't see a reason why numexpr would be > performing badly on Ubuntu. But I am not getting your performance for > blocked_thread on my AMI linux vbox: > > http://nbviewer.ipython.org/gist/anonymous/11000524 > > my numexpr amd64 package does not make use of SIMD e.g. sqrt which is > vectorized in numpy: > > numexpr: > 1.30 │ 4638: sqrtss (%r14),%xmm0 > 0.01 │ ucomis %xmm0,%xmm0 > 0.00 │ ↓ jp 11ec4 > 4.14 │ 4646: movss %xmm0,(%r15,%r12,1) > │ add %rbp,%r14 > │ add $0x4,%r12 > (unrolled a couple times) > > vs numpy: > 83.25 │190: sqrtps (%rbx,%r12,4),%xmm0 > 0.52 │ movaps %xmm0,0x0(%rbp,%r12,4) > 14.63 │ add $0x4,%r12 > 1.60 │ cmp %rdx,%r12 > │ ↑ jb 190 > > (note the ps vs ss suffix, packed vs scalar) Yup, I can reproduce that: In [4]: a = np.random.rand(int(1e8)) In [5]: %timeit np.sqrt(a) 1 loops, best of 3: 558 ms per loop In [6]: %timeit ne.evaluate('sqrt(a)') 1 loops, best of 3: 249 ms per loop In [7]: ne.set_num_threads(1) Out[7]: 8 In [8]: %timeit ne.evaluate('sqrt(a)') 1 loops, best of 3: 924 ms per loop So, yes, the non-SIMD version of sqrt in numexpr is performing quite more slowly than the SIMD one in NumPy. Of course, a numexpr compiled with MKL support can achieve similar performance than numpy in single thread mode: In [4]: %timeit ne.evaluate('sqrt(a)') 1 loops, best of 3: 191 ms per loop In [5]: ne.set_num_threads(1) Out[5]: 8 In [6]: %timeit ne.evaluate('sqrt(a)') 1 loops, best of 3: 565 ms per loop So, sqrt in numpy has barely the same speed than the one in MKL. Again, I wonder why :) -- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion