Hello All,

    I'm new to the ispc mailing list but I've been following the project 
itself with great interest since its first public release.  I've been 
interested in using SIMD parallelism for years at this point (since I read 
the first public spec for the AVX2 instruction set) and have a particular 
interest in using them to more efficiently perform tasks which, while 
inherently parallel, fall outside the computation-heavy-inner-loop category 
which, as low-hanging fruit, received the most initial attention as CPUs 
began to include wider and more general SIMD units.

    Does anyone know if there is a way to induce ispc to emit the 
*vpconflictd* instruction (or its equivalent on targets without such an 
instruction)?  If not, I think it would be a worthwhile thing to add to the 
Cross-Program Instance Operations already supported, in the form of a 
function that, given a *varying* parameter, would return a lanemask of 
"conflicting" lower-numbered lanes.

    In particular, the SPMD model is particularly attractive for realizing 
significant performance gains for certain class of problem where the core 
activity involves pulling incoming events off a queue, looking up the 
correct state machine instance, and processing the event in the context of 
its state machine instance, potentially producing an output event and 
generally updating the state machine instance as a result.  Some practical 
examples of this type of problem are:

   - Packet processing (where the state machine that is looked up and 
   updated may be the connection the packet belongs to for instance).
   - Simulations (where the state machine instances could be the individual 
   agents whose behavior the simulation models).
   - Scanning streams of events for specified patterns (where each state 
   machine instance is a potential match-in-progress).
   - and many more...

    One peculiarity about this class of applications is that while they are 
*almost* "embarrassingly parallel" there is the problem of data dependent 
resource conflicts; If you're processing packets, for instance, and you 
have thousands of active connections *most of the time* you can grab the 
next eight packets off the queue and each will belong to a different 
connection (thus they can be processed completely in parallel).  Sometimes, 
however rarely, you will grab the next eight packets off the queue and 
discover that two of them belong to the same connection and thus they must 
be serialized with respect to one another to avoid incorrect results (i.e. 
the state machine for that connection must be updated with the result of 
processing the first packet *before attempting to process the second packet* 
since the first packet may initiate a state transition that fully changes 
the context in which the second must be evaluated).

    Unlike many problems to which SIMD parallelism is applied, this class 
of problem does not have static loop-carried dependencies that are 
predictable in advance.  Rather it has dynamic (data dependent) 
dependencies which are generally rare, but correctness requires that they 
be identified such that no two events that require serialization are 
processed concurrently.  Intel was clearly aware of the degree to which 
this class of problem would benefit from SIMD parallelism and in the 
AVX-512CD instruction set extension they provided instructions like 
*vpconflictd*/*vpconflictq* which, in combination with a few other AVX-512 
instructions, provide an effective way to detect when it is required to 
serialize two work items based on their shared need of a resource.  (For 
instance, if you're looking up state machine instances in a hash table you 
could use *vpconflictd* to detect when two lanes are going to require 
access to the same bucket).  Detecting such lane-to-lane hazards early and 
dispatching a batch as two smaller batches to avoid the hazard allows the 
program logic for processing each event to remain simple and efficient.

    One might wonder why it makes sense to go through the effort of 
parallelizing a seemingly computationally trivial task like updating simple 
state machines in light of a stream of events.  There are two reasons which 
surprised me quite a bit when I first started looking into this:

   1. When you're processing a stream of events each of which can update 
   one state machine instance out of thousands, it doesn't take too many state 
   machine instances to reach the point where your cache hit rate in looking 
   up the state machines becomes unavoidably low.  This means that beyond some 
   threshold number of state machine instances your event processing rate is 
   ultimately limited by DRAM latency.  If you can use a gather to look up 
   eight state machines from your hash table in parallel the component loads 
   which the CPU breaks the gather into end up, on average, being dispatched 
   to different DRAM channels and banks.  My experiments have shown that the 
   upshot of this is that the average time it takes to fetch eight items at 
   random from a gigabyte-sized hash table is only about 2x the average time 
   to fetch a single item with a normal scalar load instruction.
   2. Many simple state machines (if implemented in the most elementary 
   way, with *if*/*else* or *switch*/*case* constructs) end up being 
   computationally trivial but very "branchy" which is especially hard on 
   modern CPUs with deep pipelines and a relatively high cost for mispredicted 
   branches.  This makes it advantageous to re-state (no pun intended) these 
   state machines in terms of data flow rather than control flow (e.g. use 
   look-up tables, compute both sides of an *if*/*else* and use a 
   conditional move or similar to select the correct one.  Modern CPUs can 
   easily exploit the instruction-level parallelism generated by computing 
   both branches and selecting one later).  Once you've re-stated most of your 
   control flow in terms of data flow already, it is only a small step to use 
   SIMD instructions to process multiple event+instance pairs in parallel.

    I feel that as more of the low-hanging fruit is picked, and as SIMD 
instruction sets that are more general and closer to orthogonal (compare 
AVX-512 with SSE for instance) become more widely available it will become 
more and more important to apply this sort of parallelism to general 
purpose computing tasks as well (rather than only number crunching).

        Cheers,
                -lars

-- 
You received this message because you are subscribed to the Google Groups 
"Intel SPMD Program Compiler Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ispc-users/2c895a62-f77f-47c7-a187-c52fce4cc807o%40googlegroups.com.

Reply via email to