Based on your benchmark description and that it works when you're using refc, you probably aren't really doing much multithreading, your benchmark's real execution model is probably not any different from "run x separate processes". If you start trying to actually share any memory, you'll be run into different trade-offs vs a shared heap I suspect. Don't know what code is running though.
I suggest you're only at the start, not the conclusion, of this problem. You have identified different behavior on ORC vs refc, but do you know why? It's a good question. You could start by doing single-thread ORC vs refc to see if it is related to threading / locks / allocation differences. You could see what happens if you minimize program code and just benchmark the async apis basically. This could make a good proof others can run and experiment with. Another aspect I see is that async has a long history and was written with a focus on refc when it was the only option. Software is not immune to its context, so it would not surprise me if getting ORC faster than refc in heavily async code took some effort. This would not happen by accident. It is entirely possible that ORC can be much better than refc but it takes identifying the problem and some engineering to address it. On the topic of benchmarking, have you benchmarked with a profiler to identify where the time is being spent in your program? VTune or AMDs uProf can help show what code paths are taking up all the time. This can be very eye opening. Basically, the details matter here and just flipping a compiler flag and expecting nothing else to matter is kind of unrealistic unfortunately. I think you have identified a excellent problem though and perhaps this will be easier to fix than one might expect. If you want others to help, please share an exact file to run with exact instructions on how to set up the compilation. I nor others have an interest in just guessing so make it easy for us please.