Resent to the group. -----Original Message----- From: Kenneth A. Lloyd [mailto:kenneth.ll...@wattsys.com] Sent: Tuesday, January 22, 2013 6:05 AM To: 'Brice Goglin' Subject: RE: [hwloc-users] hwloc tutorial material
Here's the primary issue (at least my primary issue): There is a structure to computational problem spaces - on a continuum from regular to amorphous. Predicated on the fact that there is the size and structure of data, and there is a size and structure to the program execution graph (a network graph, that is not always the same, conditioned by the data). Given these conditions, how does one look at the compute capability of an existing cluster, and configure the compute fabric (shmem distribution, and associated affinities across various devices) to effectively and efficiently address the problem? Our current solution tends toward: The programmer hard-codes the solution or the user uses heuristics to make those determinations. We have cast this as a CUDA problem, but it is more universal than that with other MPP languages (as you mentioned), Xeon Phi, other GPUs, and FPGAs. In a heterogeneous cluster, the asymmetries may complicate the solution (as may nodes being down, checkpoint / restart schedules). Of course it is incumbent for the hardware to reflect information about its capability (beyond the scope of hwloc). Sure, we can poll the nodes and use cudaGetDeviceProperties to build up potential graphs (put them in a XML-DOM or other data structure) - but even there, we generally (still) have to use associated lookup tables (IMO, a cheesy option in this day and age, but I digress). I think I understand the general direction for HPC computation using OpenMPI w/ hwloc. Perhaps a more flexible MPI using MPI_Dist_Graph (missing at present) is warranted? I'll get off my stump now. -----Original Message----- From: Brice Goglin [mailto:brice.gog...@inria.fr] Sent: Tuesday, January 22, 2013 5:15 AM To: Kenneth A. Lloyd Cc: Hardware locality user list Subject: Re: [hwloc-users] hwloc tutorial material Le 22/01/2013 10:27, Samuel Thibault a écrit : > Kenneth A. Lloyd, le Mon 21 Jan 2013 22:46:37 +0100, a écrit : >> Thanks for making this tutorial available. Using hwloc 1.7, how far >> down into, say, NVIDIA cards can the architecture be reflected? >> Global memory size? SMX cores? None of the above? > None of the above for now. Both are available in the cuda svn branch, > however. > Now the question to Kenneth is "what do YOU need?" I didn't merge the GPU internals into the trunk yet because I'd like to see if that matches what we would do with OpenCL and other accelerators such as the Xeon Phi. One thing is keep in mind is that most hwloc/GPU users will use hwloc to get locality information but they will also still use CUDA to use the GPU. So they will still be able to use CUDA to get in-depth GPU information anyway. Then the question is how much CUDA info do we want to duplicate in hwloc. hwloc could have the basic/uniform GPU information and let users rely on CUDA for everything CUDA-specific for instance. Right now, the basic/uniform part is almost empty (just contain the GPU model name or so). Also the CUDA branch creates hwloc objects inside the GPU to describe the memory/cores/caches/... Would you use these objects in your application ? or would you rather just have a basic GPU attribute structure containing the number of SMX, the memory size, ... One problem with this is that it may be hard to define a structure that works for all GPUs, even only the NVIDIA ones. We may need an union of structs... I am talking about "your application" above because having lstopo draw very nice GPU internals doesn't mean the corresponding hwloc objects are useful to real application. Brice