On 15 June 2010 11:16, Pauli Virtanen <p...@iki.fi> wrote: > ti, 2010-06-15 kello 10:10 -0400, Anne Archibald kirjoitti: >> Correct me if I'm wrong, but this code still doesn't seem to make the >> optimization of flattening arrays as much as possible. The array you >> get out of np.zeros((100,100)) can be iterated over as an array of >> shape (10000,), which should yield very substantial speedups. Since >> most arrays one operates on are like this, there's potentially a large >> speedup here. (On the other hand, if this optimization is being done, >> then these tests are somewhat deceptive.) > > It does perform this optimization, and unravels the loop as much as > possible. If all arrays are wholly contiguous, iterators are not even > used in the ufunc loop. Check the part after > > /* Determine how many of the trailing dimensions are contiguous > */ > > However, in practice it seems that this typically is not a significant > win -- I don't get speedups over the unoptimized numpy code even for > shapes > > (2,)*20 > > where you'd think that the iterator overhead could be important: > > x = np.zeros((2,)*20) > %timeit np.cos(x) > 10 loops, best of 3: 89.9 ms (90.2 ms) per loop > > This is actually consistent with the result > > x = np.zeros((2,)*20) > y = x.flatten() > %timeit x+x > 10 loops, best of 3: 20.9 ms per loop > %timeit y+y > 10 loops, best of 3: 21 ms per loop > > that you can probably verify with any Numpy installation. > > This result may actually be because the Numpy ufunc inner loops are not > especially optimized -- they don't use fixed stride sizes etc., and are > so not much more efficient than using array iterators.
This is a bit strange. I think I'd still vote for including this optimization, since one hopes the inner loop will get faster at some point. (If nothing else, the code generator can probably be made to generate specialized loops for common cases). >> On the other hand, it seems to me there's still some question about >> how to optimize execution order when the ufunc is dealing with two or >> more arrays with different memory layouts. In such a case, which one >> do you reorder in favour of? > > The axis permutation is chosen so that the sum (over different arrays) > of strides (for each dimension) is decreasing towards the inner loops. > > I'm not sure, however, that this is the optimal choice. Things may > depend quite a bit on the cache architecture here. I suspect that it isn't optimal. As I understand it, the key restriction imposed by the cache architecture is that an entire cache line - 64 bytes or so - must be loaded at once. For strides that are larger than 64 bytes I suspect that it simply doesn't matter how big they are. (There may be some subtle issues with cache associativity but this would be extremely architecture-dependent.) So I would say, first do any dimensions in which some or all strides are less than 64 bytes, starting from the smallest. After that, any order you like is fine. >> Is it acceptable to return >> freshly-allocated arrays that are not C-contiguous? > > Yes, it returns non-C-contiguous arrays. The contiguity is determined so > that the ufunc loop itself can access the memory with a single stride. > > This may cause some speed loss in some expressions. Perhaps there should > be a subtle bias towards C-order? But I'm not sure this is worth the > bother. I'm more worried this may violate some users' assumptions. If a user knows they need an array to be in C order, really they should use ascontiguousarray. But as it stands it's enough to make sure that it's newly-allocated as the result of an arithmetic expression. Incautious users could suddenly start seeing copies happening, or even transposed arrays being passed to, C and Fortran code. It's also worth exploring common usage patterns to make sure that numpy still gravitates towards using C contiguous arrays for everything. I'm imagining a user who at some point adds a transposed array to a normal array, then does a long series of computations on the result. We want as many of those operations as possible to operate on contiguous arrays, but it's possible that an out-of-order array could propagate indefinitely, forcing all loops to be done with one array having large strides, and resulting in output that is stil out-of-order. Some preference for C contiguous output is worth adding. It would also be valuable to build some kind of test harness to track the memory layout of the arrays generated in some "typical" calculations as well as the ordering of the loop used. More generally, this problem is exactly what ATLAS is for - finding cache-optimal orders for linear algebra. So we shouldn't expect it to be simple. Anne > -- > Pauli Virtanen > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion