Based on your benchmark description and that it works when you're using refc, 
you probably aren't really doing much multithreading, your benchmark's real 
execution model is probably not any different from "run x separate processes". 
If you start trying to actually share any memory, you'll be run into different 
trade-offs vs a shared heap I suspect. Don't know what code is running though.

I suggest you're only at the start, not the conclusion, of this problem. You 
have identified different behavior on ORC vs refc, but do you know why? It's a 
good question. You could start by doing single-thread ORC vs refc to see if it 
is related to threading / locks / allocation differences. You could see what 
happens if you minimize program code and just benchmark the async apis 
basically. This could make a good proof others can run and experiment with.

Another aspect I see is that async has a long history and was written with a 
focus on refc when it was the only option. Software is not immune to its 
context, so it would not surprise me if getting ORC faster than refc in heavily 
async code took some effort. This would not happen by accident. It is entirely 
possible that ORC can be much better than refc but it takes identifying the 
problem and some engineering to address it.

On the topic of benchmarking, have you benchmarked with a profiler to identify 
where the time is being spent in your program? VTune or AMDs uProf can help 
show what code paths are taking up all the time. This can be very eye opening.

Basically, the details matter here and just flipping a compiler flag and 
expecting nothing else to matter is kind of unrealistic unfortunately. I think 
you have identified a excellent problem though and perhaps this will be easier 
to fix than one might expect.

If you want others to help, please share an exact file to run with exact 
instructions on how to set up the compilation. I nor others have an interest in 
just guessing so make it easy for us please.

Reply via email to