Re: [Fwd: Re: [Beowulf] Cell in HPC]

Richard Walsh Thu, 01 Jun 2006 07:22:08 -0700

Christian Bell wrote:

On Wed, 31 May 2006, Mark Hahn wrote:
execution models to share instruction code, but splitting L2 data
across cores is bound to be a destructive use of the cache in any
data parallel model.  Obviously, user control of the cache is a large
"data parallel model" basically means you're streaming in/out of dram,
right? why are these cases not nicely covered by the placementinstructions implemented in mmx and followons? you can controlhow a load or store behaves wrt different levels of cache cache.IIRC, Intel introduced some new stuff to make the cache sharedby cores more effective this way (per-core victim traffic writes through?)
Data parallel in that cores will execute roughly the same
instructions but on disjoint data sets.  Since it's unlikely for the
granularity of a partitioned data set to be less than a cache-line,

I see multiple problems for the compiler here. The first is theone you imply in thatin order to effectively share the cache the n-threads navigatingthe loop must coordinatetheir loads regardless of the type of load (prefetched, simplescalar, and/or SSE/vector).We would like the compiler to take into account the inter threadimplications of theglobal loop load requirements and avoid redundant and/ordestructive loads.

The difficulty would seem to be multiplied in the SSE/vector loadcase (which we wantto make use of for bandwidth efficiency reasons) because we couldhave one threadpulling in a second thread's input data as it loads a two-(orfour)-word-wide vectorwhile the second thread running in the neighboring core redundantdoes the same,

    but stride-1 advanced.

Microarchitecture's with both thread and vector capabilities mustconsider the loop

    work load (in particular its loads) as a two-dimensional problem

which is sized as 'thread-number-by-vector-length' (an approachtaken in the VTAand X1E architecture). I would be interested in hearing fromcompiler folks onhow this problem is/would be handled. Thread-specific loopunrolling would seemto be useful (giving one thread compute responsibility for thevector of data it loads).Then there is the issue of dependencies both with and across threads.This says nothing about managing such vector/thread loads acrossthe partitioned global addressspace abstraction pointed at by UPC and CAF parallel programmingextensions.


     rbw

--

Richard B. Walsh

Project Manager
Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)
[EMAIL PROTECTED]  |  612.337.3467

-----------------------------------------------------------------------
This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted.  If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.

-----------------------------------------------------------------------

_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Fwd: Re: [Beowulf] Cell in HPC]

Reply via email to