On Mon, Feb 17, 2014 at 11:59 PM, Kimo Crossman <[email protected]> wrote: > "Fence-Free Work Stealing on Bounded TSO Processors" ASPLOS '14 > https://www.cs.technion.ac.il/people/mad/online-publications/asplos2014-ffwsq.pdf > > Abstract > Work stealing is the method of choice for load balancing > in task parallel programming languages and frameworks. > Yet despite considerable effort invested in optimizing work > stealing task queues, existing algorithms issue a costly memory > fence when removing a task, and these fences are believed > to be necessary for correctness. > This paper refutes this belief, demonstrating work stealing > algorithms in which a worker does not issue a memory > fence for microarchitectures with a bounded total store ordering > (TSO) memory model. Bounded TSO is a novel restriction > of TSO – capturing mainstream x86 and SPARC > TSO processors – that bounds the number of stores a load > can be reordered with. > Our algorithms eliminate the memory fence penalty, improving > the running time of a suite of parallel benchmarks > on modern x86 multicore processors by 7%11% on average > (and up to 23%), compared to the Cilk and Chase-Lev > work stealing queues.
Thanks, Kimo, for posting it here! This looks like a very interesting and novel approach. I think I can see how and why it works. Some random comments: The approach does not seem to be extensible to other synchronization algorithms (in particular mutexes) because it inherently requires state distribution (N items are available at the same time). However it's extensible to say distributed object (which is, well, essentially the same as ws dequeue). If I want to use it in a library code, I am not sure how I can reliably determine the store buffer bound. Then the size of the deque drops below the bound, they effectively switch to work-requesting. (1) I suspect that benchmarks effectively benchmark work-requesting in their case. (2) Work-requesting works well for good computational workload that they used, but is not applicable in other domains; e.g. if a task spawns 2 other tasks, one generally wants the tasks to execute in parallel, while with their algorithm both tasks will execute sequentially. (3) It would be interetsing to try to combine their approach with traditional work-stealing, i.e. size<bound=pay for the barrier, size>bound=don't pay for the barrier. It seems to be well suited for parallel for construct. Potentially it would allow to reduce per-iteration overhead to basically zero. However, you will need some fake stores, because parallel for usually stores to only 1 memory location (iteration space), so their logic won't work as is. The idea of using static code analysis to determine minimum number of stores per task looks too... academic for me. -- --- You received this message because you are subscribed to the Google Groups "Scalable Synchronization Algorithms" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/lock-free/CAEeQi3vX5Va4_r6P4i0SF0d8SSVRyazB5L08zMVq%2BJCDzjjS_w%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
