Thank you so much! We are going to try this out and I'll let you know what we find. Yes, I think we got distracted by GPU when CUDA is what we want. I'll give you a status of how it went soon....
Dave From: Brice Goglin <brice.gog...@inria.fr> To: hwloc-users@lists.open-mpi.org Date: 07/07/2017 02:02 PM Subject: Re: [hwloc-users] Why do I get such little information back about GPU's on my system Sent by: "hwloc-users" <hwloc-users-boun...@lists.open-mpi.org> Le 07/07/2017 20:38, David Solt a écrit : We are using the hwloc api to identify GPUs on our cluster. While we are able to "discover" the GPUs, other information about them does not appear to be getting filled in. See below for example: (gdb) p *obj->attr $20 = { cache = { size = 1, depth = 0, linesize = 0, associativity = 0, type = HWLOC_OBJ_CACHE_UNIFIED }, group = { depth = 1 }, pcidev = { domain = 1, bus = 0 '\000', dev = 0 '\000', func = 0 '\000', class_id = 0, vendor_id = 0, device_id = 0, subvendor_id = 0, subdevice_id = 0, revision = 0 '\000', linkspeed = 0 }, bridge = { upstream = { pci = { domain = 1, bus = 0 '\000', dev = 0 '\000', func = 0 '\000', class_id = 0, vendor_id = 0, device_id = 0, subvendor_id = 0, subdevice_id = 0, revision = 0 '\000', linkspeed = 0 } }, upstream_type = HWLOC_OBJ_BRIDGE_HOST, downstream = { pci = { domain = 0, secondary_bus = 0 '\000', subordinate_bus = 0 '\000' } }, downstream_type = HWLOC_OBJ_BRIDGE_HOST, depth = 0 }, osdev = { type = HWLOC_OBJ_OSDEV_GPU } } The name is generally just "cardX". Hello attr is an union so only the "osdev" portion above matters. "osdev" can be a lot of different things. So instead of having all possible attributes in a struct, we use info key/value pairs (hwloc_obj->infos). But those "cardX" devices are the GPU reported by the Linux kernel DRM subsystem, we don't have much information about them anyway. If you're looking at Power machine, I am going to assume you care about CUDA devices. Those are "osdev" objects of type "COPROC" instead of "GPU". They have many more attributes. Here's what I see on one of our machines: PCI 10de:1094 (P#540672 busid=0000:84:00.0 class=0302(3D) PCIVendor="NVIDIA Corporation" PCIDevice="Tesla M2075 Dual-Slot Computing Processor Module") "NVIDIA Corporation Tesla M2075 Dual-Slot Computing Processor Module" Co-Processor L#5 (CoProcType=CUDA Backend=CUDA GPUVendor="NVIDIA Corporation" GPUModel="Tesla M2075" CUDAGlobalMemorySize=5428224 CUDAL2CacheSize=768 CUDAMultiProcessors=14 CUDACoresPerMP=32 CUDASharedMemorySizePerMP=48) "cuda2" On recent kernels, you would see both a "cardX" GPU osdev and a "cudaX" COPROC osdev in the PCI device. There can even be "nvmlX" and ":0.0" if you have the nvml and nvctrl libraries. Those are basically different ways to talk the GPU (Linux kernel DRM, CUDA, etc). Given that I have never seen anybody use "cardX" for placing task/data near a GPU, I am wondering if we should disable those by default. Or maybe rename "GPU" into something that wouldn't attract people as much, maybe "DRM". Does this mean that the cards are not configured correctly? Or is there an additional flag that needs to be set to get this information? Make sure "cuda" appears in the summary at the end of the configure. Currently the code does: hwloc_topology_init(&machine_topology); hwloc_topology_set_flags(machine_topology, HWLOC_TOPOLOGY_FLAG_IO_DEVICES); hwloc_topology_load(machine_topology); And this is enough to identify the CPUs and GPUs, but any additional information - particularly the device and vendor id's - seem to not be there. I tried this with the most recent release (1.11.7) and saw the same results. We tried this on a variety of PowerPC machines and I think even some x86_64 machines with similar results. Thoughts? Dave BTW, it looks like you're not going to the OMPI dev meeting next week. I'll be there if one of your colleague wants to discuss this face to face. Brice _______________________________________________ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
_______________________________________________ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users