Hello All,
I'm new to the ispc mailing list but I've been following the project
itself with great interest since its first public release. I've been
interested in using SIMD parallelism for years at this point (since I read
the first public spec for the AVX2 instruction set) and have a particular
interest in using them to more efficiently perform tasks which, while
inherently parallel, fall outside the computation-heavy-inner-loop category
which, as low-hanging fruit, received the most initial attention as CPUs
began to include wider and more general SIMD units.
Does anyone know if there is a way to induce ispc to emit the
*vpconflictd* instruction (or its equivalent on targets without such an
instruction)? If not, I think it would be a worthwhile thing to add to the
Cross-Program Instance Operations already supported, in the form of a
function that, given a *varying* parameter, would return a lanemask of
"conflicting" lower-numbered lanes.
In particular, the SPMD model is particularly attractive for realizing
significant performance gains for certain class of problem where the core
activity involves pulling incoming events off a queue, looking up the
correct state machine instance, and processing the event in the context of
its state machine instance, potentially producing an output event and
generally updating the state machine instance as a result. Some practical
examples of this type of problem are:
- Packet processing (where the state machine that is looked up and
updated may be the connection the packet belongs to for instance).
- Simulations (where the state machine instances could be the individual
agents whose behavior the simulation models).
- Scanning streams of events for specified patterns (where each state
machine instance is a potential match-in-progress).
- and many more...
One peculiarity about this class of applications is that while they are
*almost* "embarrassingly parallel" there is the problem of data dependent
resource conflicts; If you're processing packets, for instance, and you
have thousands of active connections *most of the time* you can grab the
next eight packets off the queue and each will belong to a different
connection (thus they can be processed completely in parallel). Sometimes,
however rarely, you will grab the next eight packets off the queue and
discover that two of them belong to the same connection and thus they must
be serialized with respect to one another to avoid incorrect results (i.e.
the state machine for that connection must be updated with the result of
processing the first packet *before attempting to process the second packet*
since the first packet may initiate a state transition that fully changes
the context in which the second must be evaluated).
Unlike many problems to which SIMD parallelism is applied, this class
of problem does not have static loop-carried dependencies that are
predictable in advance. Rather it has dynamic (data dependent)
dependencies which are generally rare, but correctness requires that they
be identified such that no two events that require serialization are
processed concurrently. Intel was clearly aware of the degree to which
this class of problem would benefit from SIMD parallelism and in the
AVX-512CD instruction set extension they provided instructions like
*vpconflictd*/*vpconflictq* which, in combination with a few other AVX-512
instructions, provide an effective way to detect when it is required to
serialize two work items based on their shared need of a resource. (For
instance, if you're looking up state machine instances in a hash table you
could use *vpconflictd* to detect when two lanes are going to require
access to the same bucket). Detecting such lane-to-lane hazards early and
dispatching a batch as two smaller batches to avoid the hazard allows the
program logic for processing each event to remain simple and efficient.
One might wonder why it makes sense to go through the effort of
parallelizing a seemingly computationally trivial task like updating simple
state machines in light of a stream of events. There are two reasons which
surprised me quite a bit when I first started looking into this:
1. When you're processing a stream of events each of which can update
one state machine instance out of thousands, it doesn't take too many state
machine instances to reach the point where your cache hit rate in looking
up the state machines becomes unavoidably low. This means that beyond some
threshold number of state machine instances your event processing rate is
ultimately limited by DRAM latency. If you can use a gather to look up
eight state machines from your hash table in parallel the component loads
which the CPU breaks the gather into end up, on average, being dispatched
to different DRAM channels and banks. My experiments have shown that the
upshot of this is that the average time it takes to fetch eight items at
random from a gigabyte-sized hash table is only about 2x the average time
to fetch a single item with a normal scalar load instruction.
2. Many simple state machines (if implemented in the most elementary
way, with *if*/*else* or *switch*/*case* constructs) end up being
computationally trivial but very "branchy" which is especially hard on
modern CPUs with deep pipelines and a relatively high cost for mispredicted
branches. This makes it advantageous to re-state (no pun intended) these
state machines in terms of data flow rather than control flow (e.g. use
look-up tables, compute both sides of an *if*/*else* and use a
conditional move or similar to select the correct one. Modern CPUs can
easily exploit the instruction-level parallelism generated by computing
both branches and selecting one later). Once you've re-stated most of your
control flow in terms of data flow already, it is only a small step to use
SIMD instructions to process multiple event+instance pairs in parallel.
I feel that as more of the low-hanging fruit is picked, and as SIMD
instruction sets that are more general and closer to orthogonal (compare
AVX-512 with SSE for instance) become more widely available it will become
more and more important to apply this sort of parallelism to general
purpose computing tasks as well (rather than only number crunching).
Cheers,
-lars
--
You received this message because you are subscribed to the Google Groups
"Intel SPMD Program Compiler Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/ispc-users/2c895a62-f77f-47c7-a187-c52fce4cc807o%40googlegroups.com.