Hi, Just submitted a WIP patch for my current status. I've finished unifying the three queues and reducing the execution paths. From now on, I will reduce the locked region so that in the end, only the queue accesses are locked. Once this is done splitting the queues and implementing work-stealing will follow. The link below is a simple benchmark result form the gcc-patches submitted version. https://imgur.com/IvaBDwT The benchmark problem is computing the LU decomposition of an NxN matrix. PLASMA [1] is a task-parallel linear algebra library. The upstream version of PLASMA uses OpenMP's task scheduling system. Looking at the results, the '2nd eval' version (currently submitted patch) surpasses the upstream version's performance passed N=4096. Apparently, unifying the queues improved the performance despite the more frequent mutex lock/unlocks. Ray Kim. [1] https://bitbucket.org/icl/plasma/src