Well now I am amazed but on the other side of the pic.
I need to check if I have done something really wrong, but I have managed to get: 6074fps using float32 (I was expecting 3000fps) and 10700fps using int32, just by taking advance of multithreading.
