Hi,
Just submitted a WIP patch for my current status.
I've finished unifying the three queues and reducing the execution paths.
From now on, I will reduce the locked region so that in the end, only the queue 
accesses are locked.
Once this is done splitting the queues and implementing work-stealing will 
follow.
 
The link below is a simple benchmark result form the gcc-patches submitted 
version.  
https://imgur.com/IvaBDwT
The benchmark problem is computing the LU decomposition of an NxN matrix.
PLASMA [1] is a task-parallel linear algebra library.
The upstream version of PLASMA uses OpenMP's task scheduling system.
 
Looking at the results, the '2nd eval' version (currently submitted patch) 
surpasses the upstream version's performance passed N=4096. 
Apparently, unifying the queues improved the performance despite the 
more frequent mutex lock/unlocks.
 
Ray Kim.
 
[1] https://bitbucket.org/icl/plasma/src

Reply via email to