On Oct 6, 2012, at 9:16 PM, Karl Rupp <rupp at mcs.anl.gov> wrote:
> Hi Barry,
>
>> Let's see if we can lift this discussion up another level and "treat"
>> multi-core threading more specifically in the discussion (though Karl's
>> subject name is Unification approach for OpenMP/Threads/... he largely
>> ignores the multi-core/multi-socket aspect).
>
> Probably I should have called the discussion 'Part 1: Memory Handles', yet
> I'm fine with considering multi-core issues as well.
>
>
>> Abstractly a node has
>>
>> 1) a bunch of memories (some may be "nested" as caches "standing in" for
>> parts of larger caches which "stand in" for parts of "main memory". ) In
>> general, even without GPUs there are multiple memory sockets (though
>> generally handled by the OS as a single unified address space),
>>
>> 2) a bunch of compute "thingies". In general, even without GPUs there are
>> multiple CPUs, and each one of those likely has "regular" floating point
>> units plus SIMD units.
>>
>>
>> A) Shri has started coding up a runtime dispatch system for computations on
>> multiple cores (which hides differences between PThreads and OpenMP) that
>> (currently) assumes Vecs are stored in a single array (each thread accesses
>> the array pointer via VecGetArray() and then "its" part of the array by an
>> offset.) (BTW: what if each of these VecGetArray() triggered a copy up from
>> a GPU, probably a mess). When using PThreads Shir's model allows (to some
>> degree) the asynchronous launching of computational tasks.
>>
> I've discussed multi-core related topics (NUMA, first-touch) a little with
> Shri. As the operating system performs the allocation 'automagically' (per
> default), a single virtually linear piece of memory pretty much performs
> reasonably. I didn't dare to start a discussion of handling buffers in main
> memory similarly to std::deque, i.e. as a collection of individually
> allocated pages, and try to keep track of locality informations. This,
> however, would completely break any XYZGetArray() code, as the function
> enforces a large chunk of linear memory again.
>
> Still, one may 'fake' a std::deque by holding meta-information about the
> physical memory pages nevertheless, yet allocate a (virtually) linear piece
> of memory to keep compatibility with XYZGetArray(). This would allow some
> nice optimizations for the threading scheduler, as threads may operate on
> 'nearer' pages first.
>
>
>> B) We have a different dispatch system for using a single GPU accelerator
>> via CUDA that "automagically" handles copying data back and forth from
>> memories via VecXXXGetArray(). It is synchronous on the GetArray() in that
>> is always blocks on the GetArray() until the data is there and then moves on
>> to the computation.
>
> I'm afraid that the return type of VecXXXGetArray(), i.e. a pointer to the
> data, is such a strong requirement that one cannot relax the blocking
> transfer here.
>
> However, we could use the thread communicator, schedule an asynchronous
> memory transfer via a non-blocking VecXXXGetArrayRequest() returning an event
> object, possibly perform some other operations in the meanwhile, and finally
> sync to the event object at the time we actually start modifying the array.
>
>
>> C) We are considering options for using OpenCL kernels.
>
> Btw: Shri's threading communicator is almost identical to the OpenCL model
> (with the latter having a few additional capabilities).
>
>
>> D) We have not seriously considering utilizing both GPUs and core processors
>> for floating point intensive computations at the same time, either on the
>> "same" object computation or completely different object computations. (note
>> that DOE bought this huge machine at ORNL that seems to require this).
>>
>> Ideally we'd have a "single" high performing programming model for
>> utilizing the resources of (1-2) regardless of details.
>
> I'm pretty confident that feeding operations (including GPU operations) into
> the task queue of the thread communicator will give good results.
Problem already solved http://dl.acm.org/citation.cfm?id=2145863 :-)
>
>
>>
>> Now, lets go to Karl's "Part 1: Memory" which is a good place to start.
>> In PETSc we basically have two data types, a Vec which is relatively easy to
>> abstract about and a Mat which is not. Let's focus just on the Vec now
>> because Mat's are hard.
>>
>> We need to "divide up" the computation on a Vec (or several Vecs and
>> Mats) so that the different compute "thingies" can work on their "piece",
>> this division of the computation naturally is associated with a "division"
>> of the data (the division may actually be only abstract with pthreads or it
>> may be concrete with two GPUs when "half" of the vector is copied to each
>> GPU's memory (sorry Jed, I agree with Karl that we likely shouldn't hide
>> this issue behind MPI)). The "division" is non-overlapping in simple cases
>> (like axpy()) or may require "ghosting" for sparse matrix-vector products
>> (again the division my only be abstract). With multi-memory-socket
>> multi-core we actually divide the vector data across physical memories but
>> access it via virtual memory as not divided up for ghost points etc. I
>> think the "special cases" like virtual memory make it harder for us to think
>> about this abstractly then it should be.
>>
>> In PETSc we use the abstract object IS to indicate parts of
>> Vecs\footnote. Thus if a computation requires part of a vector it is
>> natural to pass into the function the Vec AND THE IS indicating that part of
>> the Vec needed. Note that Shri's use of code such as i=trstarts[thread_id]
>> is actually a particular type of IS (hardwired for performance).
>
> A bit of a spoiler for the actual job runtime (more brainstorming than
> complete suggestions):
> I can imagine submitting the Vec, the IS, and the type of job to the
> scheduler, possibly including some hints on the type of operations to follow.
> One may even enforce a certain type of device here, even though this requires
> the scheduler to move the data in place first. In this way one can perform
> smaller tasks on the respective CPU core (if we keep track of affinity
> information), and offload larger tasks to an available accelerator if
> possible. (Note that this is the main reason why I don't want to hide buffers
> in library-specific derived classes of Vec). The scheduler can use simple
> heuristics on where to perform operations based on typical latencies (e.g.
> ~20us for a GPU kernel)
>
>
>> So, could we use a single kernel launcher for multi-core, CUDA, OpenCL
>> based on this principle? Then VecCUDAGetArray() type things would keep track
>> of parts of Vecs based on IS instead of all entries in the Vec. Similarly
>> there would be a VecMultiCoreGetArray(). Whenever possible the
>> VecXXXGetArray() would not require copies. As part of this model I'd also
>> like to separate the "moving needed data" part of the kernel from the
>> "computation on the data" so that everything doesn't block when data is
>> being moved around.
>
> Yes, we could/can. A single kernel launcher also allows for fusing kernels,
> e.g. matrix-vector-product followed by an inner product of the result vector
> with some other vector. As outlined above, asynchronous data movement could
> even be the default rather than the exception, except for cases where one
> gives control over the data to the outside by e.g. returning a pointer to the
> array. In such cases one would have first wait for all operations on all data
> to finish.
>
> The main concern in all that is the readiness of the user. Awareness for
> asynchronous operations keeps rising, yet I can imagine user code like
>
> PetscScalar * data = VecXYZGetArray(v1); // flushes the queue suitably
> data[0] = VecDot(v2, v3); // enqueues VecDot
> PetscScalar s = data[0]; // VecDot may not be finished!
>
> where a pointer given away once undermines everything.
>
>
>> Ok, how about moving this same model up to the MPI level? We already do
>> this with IS converted to VecScatter (for performance) for updating ghost
>> points (for matrix-vector products, for PDE ghost points etc) (note we can
>> hide the VecScatter inside the IS and have it created as needed).
>
> I'm afraid I can't contribute to the MPI discussion yet, as I don't know
> enough about things are handled now...
>
> Best regards,
> Karli
>
>