On 26/10/2007, Travis E. Oliphant <[EMAIL PROTECTED]> wrote: > There is an optimization where-in the inner-loops are done over the > dimension with the smallest stride. > > What other cache-coherent optimizations do you recommend?
That sounds like a very good first step. I'm far from an expert on this sort of thing, but here are a few ideas at random: * internally flattening arrays when this doesn't affect the result (e.g. ones((10,10))+ones((10,10))) * prefetching memory: in a C application I recently wrote, explicitly prefetching data for interpolation cut my runtime by 30%. This includes telling the processor when you're done with data so it can be purged from the cache. * aligning (some) arrays to 8- 16- 32- or 64-byte boundaries so that they divide nicely into cache lines * using MMX/SSE instructions when available * combining ufuncs so that computations can keep the CPU busy while it waits for data to come in from main RAM (I realize that this is properly the domain of numexpr) * using ATLAS- or FFTW-style autotuning to determine the fastest ways to structure computations (again more relevant for significant expressions rather than simple ufuncs) * reducing use of temporaries in the interest of reducing traffic to main memory * openmp parallel operations when this actually speeds up calculation I realize most of these are a lot of work, and some of them are probably in numpy already. Moreover without using an expression parser it's probably not feasible to implement others. But an array language offers the possibility that a runtime can implement all sorts of optimizations without effort on the user's part. Anne _______________________________________________ Numpy-discussion mailing list [email protected] http://projects.scipy.org/mailman/listinfo/numpy-discussion
