Just thought I might share a real-life case study today. Been a lot of talk of SIMD stuff, some people might be interested.
Working on an android product today, I noticed the matrix library was burning a ridiculous amount of our frame time. The disassembly looked like pretty normal ARM float code, so rewriting a couple of the key routines to use the VFPU (carefully), our key device moved from 19fps -> 34fps (limited at 30, we can now ship). GalaxyS 2 is now running at 170fps, and devices we previously considered un-viable can now actually get a release! .. Most devices saw around 25-45% speed improvement. Imagine if all vector code throughout was using the vector hardware nicely, and not just one or 2 key functions... Getting the API right (intuitively encouraging proper usage and disallowing inefficient operations), it'll make a big difference!
