On Mon, Feb 17, 2014 at 11:59 PM, Kimo Crossman <[email protected]> wrote:
> "Fence-Free Work Stealing on Bounded TSO Processors" ASPLOS '14
> https://www.cs.technion.ac.il/people/mad/online-publications/asplos2014-ffwsq.pdf
>
> Abstract
> Work stealing is the method of choice for load balancing
> in task parallel programming languages and frameworks.
> Yet despite considerable effort invested in optimizing work
> stealing task queues, existing algorithms issue a costly memory
> fence when removing a task, and these fences are believed
> to be necessary for correctness.
> This paper refutes this belief, demonstrating work stealing
> algorithms in which a worker does not issue a memory
> fence for microarchitectures with a bounded total store ordering
> (TSO) memory model. Bounded TSO is a novel restriction
> of TSO – capturing mainstream x86 and SPARC
> TSO processors – that bounds the number of stores a load
> can be reordered with.
> Our algorithms eliminate the memory fence penalty, improving
> the running time of a suite of parallel benchmarks
> on modern x86 multicore processors by 7%􀀀11% on average
> (and up to 23%), compared to the Cilk and Chase-Lev
> work stealing queues.


Thanks, Kimo, for posting it here!


This looks like a very interesting and novel approach. I think I can
see how and why it works.

Some random comments:
The approach does not seem to be extensible to other synchronization
algorithms (in particular mutexes) because it inherently requires
state distribution (N items are available at the same time). However
it's extensible to say distributed object (which is, well, essentially
the same as ws dequeue).

If I want to use it in a library code, I am not sure how I can
reliably determine the store buffer bound.

Then the size of the deque drops below the bound, they effectively
switch to work-requesting. (1) I suspect that benchmarks effectively
benchmark work-requesting in their case. (2) Work-requesting works
well for good computational workload that they used, but is not
applicable in other domains; e.g. if a task spawns 2 other tasks, one
generally wants the tasks to execute in parallel, while with their
algorithm both tasks will execute sequentially. (3) It would be
interetsing to try to combine their approach with traditional
work-stealing, i.e. size<bound=pay for the barrier, size>bound=don't
pay for the barrier.

It seems to be well suited for parallel for construct. Potentially it
would allow to reduce per-iteration overhead to basically zero.
However, you will need some fake stores, because parallel for usually
stores to only 1 memory location (iteration space), so their logic
won't work as is.

The idea of using static code analysis to determine minimum number of
stores per task looks too... academic for me.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3vX5Va4_r6P4i0SF0d8SSVRyazB5L08zMVq%2BJCDzjjS_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to