Hmmm. I disassembled my own code, and I'm finding essentially the exact same inner loops as you. But the speed of the two approaches differs by a factor of 5 on my machine. Could it be something relating to cache sizes --- maybe my machine is continuously missing the cache on the dereference and yours is continually hitting it?
I'll post my code tomorrow, but it's quite similar to yours. rif
