CUDA: Part 1: Memory

Barry Smith Sat, 6 Oct 2012 21:52:57 -0500

On Oct 6, 2012, at 9:16 PM, Karl Rupp <rupp at mcs.anl.gov> wrote:

> Hi Barry,
> 
>>    Let's see if we can lift this discussion up another level and "treat" 
>> multi-core threading more specifically in the discussion (though Karl's 
>> subject name is Unification approach for OpenMP/Threads/... he largely 
>> ignores the multi-core/multi-socket aspect).
> 
> Probably I should have called the discussion 'Part 1: Memory Handles', yet 
> I'm fine with considering multi-core issues as well.
> 
> 
>>     Abstractly a node has
>> 
>> 1)  a bunch of memories (some may be "nested" as caches "standing in" for 
>> parts of larger caches which "stand in" for parts of "main memory". )  In 
>> general, even without GPUs there are multiple memory sockets (though 
>> generally handled by the OS as a single unified address space),
>> 
>> 2) a bunch of compute "thingies". In general, even without GPUs there are 
>> multiple CPUs, and each one of those likely has "regular" floating point 
>> units plus SIMD units.
>> 
>> 
>> A) Shri has started coding up a runtime dispatch system for computations on 
>> multiple cores (which hides differences between PThreads and OpenMP) that 
>> (currently) assumes Vecs are stored in a single array (each thread accesses 
>> the array pointer via VecGetArray() and then "its" part of the array by an 
>> offset.) (BTW: what if each of these VecGetArray() triggered a copy up from 
>> a GPU, probably a mess).  When using PThreads Shir's model allows (to some 
>> degree) the asynchronous launching of computational tasks.
>> 
> I've discussed multi-core related topics (NUMA, first-touch) a little with 
> Shri. As the operating system performs the allocation 'automagically' (per 
> default), a single virtually linear piece of memory pretty much performs 
> reasonably. I didn't dare to start a discussion of handling buffers in main 
> memory similarly to std::deque, i.e. as a collection of individually 
> allocated pages, and try to keep track of locality informations. This, 
> however, would completely break any XYZGetArray() code, as the function 
> enforces a large chunk of linear memory again.
> 
> Still, one may 'fake' a std::deque by holding meta-information about the 
> physical memory pages nevertheless, yet allocate a (virtually) linear piece 
> of memory to keep compatibility with XYZGetArray(). This would allow some 
> nice optimizations for the threading scheduler, as threads may operate on 
> 'nearer' pages first.
> 
> 
>> B) We have a different dispatch system for using a single GPU accelerator 
>> via CUDA that "automagically" handles copying data back and forth from 
>> memories via VecXXXGetArray(). It is synchronous on the GetArray() in  that 
>> is always blocks on the GetArray() until the data is there and then moves on 
>> to the computation.
> 
> I'm afraid that the return type of VecXXXGetArray(), i.e. a pointer to the 
> data, is such a strong requirement that one cannot relax the blocking 
> transfer here.
> 
> However, we could use the thread communicator, schedule an asynchronous 
> memory transfer via a non-blocking VecXXXGetArrayRequest() returning an event 
> object, possibly perform some other operations in the meanwhile, and finally 
> sync to the event object at the time we actually start modifying the array.
> 
> 
>> C) We are considering options for using OpenCL kernels.
> 
> Btw: Shri's threading communicator is almost identical to the OpenCL model 
> (with the latter having a few additional capabilities).
> 
> 
>> D) We have not seriously considering utilizing both GPUs and core processors 
>> for floating point intensive computations at the same time, either on the 
>> "same" object computation or completely different object computations. (note 
>> that DOE bought this huge machine at ORNL that seems to require this).
>> 
>>   Ideally we'd have a "single" high performing programming model for 
>> utilizing the resources of (1-2) regardless of details.
> 
> I'm pretty confident that feeding operations (including GPU operations) into 
> the task queue of the thread communicator will give good results.


    Problem already solved http://dl.acm.org/citation.cfm?id=2145863   :-)


> 
> 
>> 
>>    Now, lets go to Karl's "Part 1: Memory" which is a good place to start.   
>> In PETSc we basically have two data types, a Vec which is relatively easy to 
>> abstract about and a Mat which is not.  Let's focus just on the Vec now 
>> because Mat's are hard.
>> 
>>    We need to "divide up" the computation on a Vec (or several Vecs and 
>> Mats) so that the different compute "thingies" can work on their "piece", 
>> this division of the computation naturally is associated with a "division" 
>> of the data  (the division may actually be only abstract with pthreads or it 
>> may be concrete with two GPUs when "half" of the vector is copied to each 
>> GPU's memory (sorry Jed, I agree with Karl that we likely shouldn't hide 
>> this issue behind MPI)).  The "division" is non-overlapping in simple cases 
>> (like axpy()) or may require "ghosting" for  sparse matrix-vector products 
>> (again the division my only be abstract).  With multi-memory-socket 
>> multi-core we actually divide the vector data across physical memories but 
>> access it via virtual memory as not divided up for ghost points etc.  I 
>> think the "special cases" like virtual memory make it harder for us to think 
>> about this abstractly then it should be.
>> 
>>    In PETSc we use the abstract object IS to indicate parts of 
>> Vecs\footnote.  Thus if a computation requires part of a vector it is 
>> natural to pass into the function the Vec AND THE IS indicating that part of 
>> the Vec needed. Note that Shri's use of code such as i=trstarts[thread_id] 
>> is actually a particular type of IS (hardwired for performance).
> 
> A bit of a spoiler for the actual job runtime (more brainstorming than 
> complete suggestions):
> I can imagine submitting the Vec, the IS, and the type of job to the 
> scheduler, possibly including some hints on the type of operations to follow. 
> One may even enforce a certain type of device here, even though this requires 
> the scheduler to move the data in place first. In this way one can perform 
> smaller tasks on the respective CPU core (if we keep track of affinity 
> information), and offload larger tasks to an available accelerator if 
> possible. (Note that this is the main reason why I don't want to hide buffers 
> in library-specific derived classes of Vec). The scheduler can use simple 
> heuristics on where to perform operations based on typical latencies (e.g. 
> ~20us for a GPU kernel)
> 
> 
>>    So, could we use a single kernel launcher for multi-core, CUDA, OpenCL 
>> based on this principle? Then VecCUDAGetArray() type things would keep track 
>> of parts of Vecs based on IS instead of all entries in the Vec.  Similarly 
>> there would be a VecMultiCoreGetArray(). Whenever possible the 
>> VecXXXGetArray() would not require copies.    As part of this model I'd also 
>> like to separate the "moving needed data" part of the kernel from the 
>> "computation on the data" so that everything doesn't block when data is 
>> being moved around.
> 
> Yes, we could/can. A single kernel launcher also allows for fusing kernels, 
> e.g. matrix-vector-product followed by an inner product of the result vector 
> with some other vector. As outlined above, asynchronous data movement could 
> even be the default rather than the exception, except for cases where one 
> gives control over the data to the outside by e.g. returning a pointer to the 
> array. In such cases one would have first wait for all operations on all data 
> to finish.
> 
> The main concern in all that is the readiness of the user. Awareness for 
> asynchronous operations keeps rising, yet I can imagine user code like
> 
> PetscScalar * data = VecXYZGetArray(v1); // flushes the queue suitably
> data[0] = VecDot(v2, v3);                // enqueues VecDot
> PetscScalar s = data[0];                 // VecDot may not be finished!
> 
> where a pointer given away once undermines everything.
> 
> 
>>    Ok, how about moving this same model up to the MPI level? We already do 
>> this with IS converted to VecScatter (for performance) for updating ghost 
>> points (for matrix-vector products, for PDE ghost points etc) (note we can 
>> hide the VecScatter inside the IS and have it created as needed).
> 
> I'm afraid I can't contribute to the MPI discussion yet, as I don't know 
> enough about things are handled now...
> 
> Best regards,
> Karli
> 
>

[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Reply via email to