I've narrowed the problem down to a single line of code: `calloc()` on 2MB - 20MB, many times. I guess the standard malloc/calloc uses a locking call for large allocations.
So there are probably several solutions: 1. Re-write the C code in Nim, which uses thread-local allocation by default. 2. Use a different memory-manager than the Linux/gcc default. 3. multiprocessing (starting with Araq's suggestion) This was probably not a case of "False Sharing". Still, I don't quite understand why the same code is fast in the main thread. Wouldn't the large allocation still use a lock?
