Christian Bell wrote:
On Wed, 31 May 2006, Mark Hahn wrote:

execution models to share instruction code, but splitting L2 data
across cores is bound to be a destructive use of the cache in any
data parallel model.  Obviously, user control of the cache is a large
"data parallel model" basically means you're streaming in/out of dram,
right? why are these cases not nicely covered by the placement instructions implemented in mmx and followons? you can control how a load or store behaves wrt different levels of cache cache. IIRC, Intel introduced some new stuff to make the cache shared by cores more effective this way (per-core victim traffic writes through?)

Data parallel in that cores will execute roughly the same
instructions but on disjoint data sets.  Since it's unlikely for the
granularity of a partitioned data set to be less than a cache-line,
I see multiple problems for the compiler here. The first is the one you imply in that in order to effectively share the cache the n-threads navigating the loop must coordinate their loads regardless of the type of load (prefetched, simple scalar, and/or SSE/vector). We would like the compiler to take into account the inter thread implications of the global loop load requirements and avoid redundant and/or destructive loads.

The difficulty would seem to be multiplied in the SSE/vector load case (which we want to make use of for bandwidth efficiency reasons) because we could have one thread pulling in a second thread's input data as it loads a two-(or four)-word-wide vector while the second thread running in the neighboring core redundant does the same,
    but stride-1 advanced.

Microarchitecture's with both thread and vector capabilities must consider the loop
    work load (in particular its loads) as a two-dimensional problem
which is sized as 'thread-number-by-vector-length' (an approach taken in the VTA and X1E architecture). I would be interested in hearing from compiler folks on how this problem is/would be handled. Thread-specific loop unrolling would seem to be useful (giving one thread compute responsibility for the vector of data it loads). Then there is the issue of dependencies both with and across threads. This says nothing about managing such vector/thread loads across the partitioned global address space abstraction pointed at by UPC and CAF parallel programming extensions.

     rbw

--

Richard B. Walsh

Project Manager
Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)
[EMAIL PROTECTED]  |  612.337.3467

-----------------------------------------------------------------------
This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted.  If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.
-----------------------------------------------------------------------
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to