On 27/12/2021 12:10 AM, max haughton wrote:
I would start by removing the use of stdout in your loop kernel - I'm
not familiar with what you are calculating, but if you can basically
have the (parallel) loop operate from (say) one array directly into
another then you can get extremely good parallel scaling with almost no
effort.
Not using in the actual loop should make the code faster even without
threads because having a function call in the hot code will mean
compilers optimizer will give up on certain transformations - i.e. do
all the work as compactly as possible then output the data in one step
at the end.
It'll speed it up significantly.
Standard IO has locks in it. So you end up with all calculations
grinding to a half waiting for another thread to finish doing something.