I have been trying to find out how many branch prediction units there are 
in a typical Haswell, but so far haven't gotten a definitive answer. Is 
there a single unit per core or per socket or per processor? I have been 
pondering on the implications of this on how to architect threaded 
applications. For example I have an application that has worker threads 
that do something like this:

// A single worker thread.
while (true) {
  waitForIncomingEventsOnRingBuffer();
  for each event do {
    foo(event);
    bar(event);
    baz(event);
  }
}

A work producer produces events and round robins between each worker thread 
which completely processes the event. So each event is processed only by a 
single thread. One reason for doing things this way is once an event is in 
cache, it makes sense to process it all the way through.

Now imagine that each function has quite a few branches. If the branch 
prediction unit becomes a bottleneck then we might take quite a few 
pipeline flushes from bad predictions. Alternatively we could:

i) Have the master thread put all events on a single buffer.
ii) Have each worker thread advance process every event on this buffer but 
only call one function (foo, bar or baz). Assuming that there are no 
ordering dependencies between the functions this should be fine. For 
example imagine foo is persisting to disk, bar is calculating some 
statistics and baz is doing the actual processing. The producer only 
considers a slot empty if all three consumers have incremented their 
respective cursors beyond said slot.

Now in this scenario if there was indeed a branch prediction unit per core 
and assuming each thread maps onto a core, we could possibly get better 
branch prediction since there is less code being executed on said thread. 
There is a secondary benefit too in that each function will probably bring 
in some cache lines, so if we split them across threads they will each have 
their own L1 and L2. On the other hand if we execute each function on every 
thread then they cause more cache churn.

All of this is conjecture of course and real measurement would be the best 
way to know. I just wanted to get an understanding of the kind of reasons 
people use to decide when to go for 

i) A pipelined architecture(one stage = specific function) VS a parallel 
one (parallel stages each doing all function). Better branch prediction 
might be one of the concerns.
ii) In a pipelined architecture how does one decide where one pipeline 
begins and another ends i.e. how to break up the processing of a single 
unit of work across stages/threads.

Thanks!

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/fe4c6bfb-4a66-4075-be68-a5fd77eff359%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to