I question the results when something beats C for speed. Either it is poor C code, or there is something weird going on.
Sure, get close or the same, but beat C for speed? You must be doing some ASM stuff that is faster than C-generated ASM (good for them, if that is the case).
