On my codes, pre-RA instruction scheduling on X86-64 (a) improves run times by roughly 10%, and (b) costs a lot of compile time.
The -fscheduling option didn't seem to be on in your time tests (I think it's not on by default on that architecture at -O2). Brad