TSC has lots of platform related issues. It is not guaranteed sync'd across physical packages and AMD boxes have lots of problems.
Why does delay_ms not just use nanosleep() and let the OS worry about it? On a related note, I have found that putting the worker (non master) threads into real time scheduling class also helps.