On 10/7/2010 3:48 PM, Anne Archibald wrote: > > Years ago MATLAB did just this - store real and complex parts of > arrays separately (maybe it still does, I haven't used it in a long > time). It caused us terrible performance headaches,
Years ago performance headaches from Matlab are not exactly relevant today, for several reasons. There have been other languages that made that choice - APL for example. Your performance problems there could have been caused (depending on machine) by things like TLB congestion, other address translation issues (segmented memory). Most machines now and in the future are not going to choke on these issues (for a variety of reasons). Note that even as early as 1989, it was possible to use the memory mapping technique I described (in SunOS 4.0) and on that machine, the ability to fill a register with all zeros in a single instruction means that there would have been almost no penalty for the address translation. What matters now, and for the foreseeable future, is pretty much the memory bandwidth - memory bandwidth at some cache level - memory for storage and FPU bandwidth is not limiting any more. Let's be clear about apples to apples comparison. Let's consider a very large array which sometimes has nonzero imaginary part as a result of calculation (and therefore not predictable when the code is written). Strategy I is for the programmer to declare the array as complex from the beginning - thereby doubling the required memory bandwidth (at every level involved) for the entire computation. You double the cache footprint at every level, etc. You will pretty much eat the full factor of 2 memory bandwidth penalty. Strategy II is to have the imaginary part of the array as a sparse (initially null) file. Well, if the array never has a nonzero imaginary part, then the memory bandwidth is the same at all levels except perhaps at the level of the register file. This could conceivably change if the granularity of the file system changes to have pages much larger than 4kB, but unless someone decides to have a page size larger than the L1 cache on their processor, it's really just the register file where you care. Well suppose you end up really caring. You actually can test, once per page, if the imaginary page is the zero page, and call the real code instead - giving you essentially full bandwidth on the comptation down to the granularity of a page. Since a page is 512 doubles, the overhead of this test is less than one branch instruction per 512 flops - but with modern CPU architecture that branch will get predicted, so normally there will be zero overhead for this test. In other words, if you know what you're doing, you should achieve full bandwidth. Of course, you probably have the question of locality of reference in mind. Here there are at least two answers to any possible objection there - the first is that modern address translation is good enough to interleave two streams anyway - you can actually check this right now on your machine. Time how long it takes to add two very long vectors (say of length 2N) to produce a third. Then compare this to the time it takes to add two vectors of length N to produce a third, and two other vectors of length N to produce a fourth. Unless those times are significantly different as N becomes very large, you have confirmed that your machine can translate addresses and shovel segregated real and complex parts through the FPU about as fast as it can shovel them the corresponding interleaved real and complex parts through the FPU. Well OK suppose you worry about N so big that having real and complex parts in different parts of the file system is an issue - you could always interleave the PAGES of the real and imaginary parts to eliminate address differences of more than one page length from consideration. That's answer number two. > since it meant > that individual complex values were spread over two different memory > areas (parts of the disk in fact, since we were using gigabyte arrays > in 1996), so that writing operations in the natural way tended to > cause disk thrashing. Yeah you didn't understand memory mapping, which was available back in 1989. We were using 4GB arrays by 1990. I got really tired of evangelizing memory mapping to various communities (which I did a lot back then, including the early Numerical Python community). So these days I just say "I told you so." Kevin Sheehan told you so too (His classic paper "Why aren't you using mmap() yet?" is from February 1996). > As for what's right for numpy, well, I think it makes a lot more sense > to simply raise an exception when assigning a complex value to a real > array (or store a NaN). That would at least allow a workaround. However I want quiet NaNs to stay quiet, only signalling NaNs are allowed to get up in my grille. Best Regards, Andrew Mullhaupt _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
