I have been trying to find out how many branch prediction units there are
in a typical Haswell, but so far haven't gotten a definitive answer. Is
there a single unit per core or per socket or per processor? I have been
pondering on the implications of this on how to architect threaded
applications. For example I have an application that has worker threads
that do something like this:
// A single worker thread.
while (true) {
waitForIncomingEventsOnRingBuffer();
for each event do {
foo(event);
bar(event);
baz(event);
}
}
A work producer produces events and round robins between each worker thread
which completely processes the event. So each event is processed only by a
single thread. One reason for doing things this way is once an event is in
cache, it makes sense to process it all the way through.
Now imagine that each function has quite a few branches. If the branch
prediction unit becomes a bottleneck then we might take quite a few
pipeline flushes from bad predictions. Alternatively we could:
i) Have the master thread put all events on a single buffer.
ii) Have each worker thread advance process every event on this buffer but
only call one function (foo, bar or baz). Assuming that there are no
ordering dependencies between the functions this should be fine. For
example imagine foo is persisting to disk, bar is calculating some
statistics and baz is doing the actual processing. The producer only
considers a slot empty if all three consumers have incremented their
respective cursors beyond said slot.
Now in this scenario if there was indeed a branch prediction unit per core
and assuming each thread maps onto a core, we could possibly get better
branch prediction since there is less code being executed on said thread.
There is a secondary benefit too in that each function will probably bring
in some cache lines, so if we split them across threads they will each have
their own L1 and L2. On the other hand if we execute each function on every
thread then they cause more cache churn.
All of this is conjecture of course and real measurement would be the best
way to know. I just wanted to get an understanding of the kind of reasons
people use to decide when to go for
i) A pipelined architecture(one stage = specific function) VS a parallel
one (parallel stages each doing all function). Better branch prediction
might be one of the concerns.
ii) In a pipelined architecture how does one decide where one pipeline
begins and another ends i.e. how to break up the processing of a single
unit of work across stages/threads.
Thanks!
--
---
You received this message because you are subscribed to the Google Groups
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/lock-free/fe4c6bfb-4a66-4075-be68-a5fd77eff359%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.