Re: [hwloc-users] Why do I get such little information back about GPU's on my system

David Solt Fri, 07 Jul 2017 12:52:23 -0700

Oh, Geoff Paulsen will be there at Open MPI meeting and he can help with 
the discussion.   We tried searching for


  // Iterate over each osdevice and identify the GPU's on each socket.
  while ((obj = hwloc_get_next_osdev(machine_topology, obj)) != NULL) {
    if (HWLOC_OBJ_OSDEV_COPROC == obj->attr->osdev.type) {   // was 
HWLOC_OBJ_OSDEV_GPU

Currently we are not finding any such devices.   Does this require that 
the cuda libraries be installed on the system for hwloc to find the 
hardware?

Thanks,
Dave



From:   "David Solt" <ds...@us.ibm.com>
To:     Hardware locality user list <hwloc-users@lists.open-mpi.org>
Date:   07/07/2017 02:21 PM
Subject:        Re: [hwloc-users] Why do I get such little information 
back about GPU's on my system
Sent by:        "hwloc-users" <hwloc-users-boun...@lists.open-mpi.org>



Thank you so much!   We are going to try this out and I'll let you know 
what we find.   Yes, I think we got distracted by GPU when CUDA is what we 
want.  I'll give you a status of how it went soon....

Dave



From:        Brice Goglin <brice.gog...@inria.fr>
To:        hwloc-users@lists.open-mpi.org
Date:        07/07/2017 02:02 PM
Subject:        Re: [hwloc-users] Why do I get such little information 
back about GPU's on my system
Sent by:        "hwloc-users" <hwloc-users-boun...@lists.open-mpi.org>



Le 07/07/2017 20:38, David Solt a écrit :
We are using the hwloc api to identify GPUs on our cluster. While we are 
able to "discover" the GPUs, other information about them does not appear 
to be getting filled in. See below for example: 
(gdb) p *obj->attr
$20 = {
  cache = {
    size = 1, 
    depth = 0, 
    linesize = 0, 
    associativity = 0, 
    type = HWLOC_OBJ_CACHE_UNIFIED
  }, 
  group = {
    depth = 1
  }, 
  pcidev = {
    domain = 1, 
    bus = 0 '\000', 
    dev = 0 '\000', 
    func = 0 '\000', 
    class_id = 0, 
    vendor_id = 0,
    device_id = 0, 
    subvendor_id = 0, 
    subdevice_id = 0, 
    revision = 0 '\000', 
    linkspeed = 0
  }, 
  bridge = {
    upstream = {
      pci = {
        domain = 1, 
        bus = 0 '\000', 
        dev = 0 '\000', 
        func = 0 '\000', 
        class_id = 0, 
        vendor_id = 0, 
        device_id = 0, 
        subvendor_id = 0, 
        subdevice_id = 0, 
        revision = 0 '\000', 
        linkspeed = 0
      }
    }, 
    upstream_type = HWLOC_OBJ_BRIDGE_HOST, 
    downstream = {
      pci = {
        domain = 0, 
        secondary_bus = 0 '\000', 
        subordinate_bus = 0 '\000'
      }
    }, 
    downstream_type = HWLOC_OBJ_BRIDGE_HOST, 
    depth = 0
  }, 
  osdev = {
    type = HWLOC_OBJ_OSDEV_GPU
  }
} 
The name is generally just "cardX".  

Hello

attr is an union so only the "osdev" portion above matters. "osdev" can be 
a lot of different things. So instead of having all possible attributes in 
a struct, we use info key/value pairs (hwloc_obj->infos). But those 
"cardX" devices are the GPU reported by the Linux kernel DRM subsystem, we 
don't have much information about them anyway.

If you're looking at Power machine, I am going to assume you care about 
CUDA devices. Those are "osdev" objects of type "COPROC" instead of "GPU". 
They have many more attributes. Here's what I see on one of our machines:
  PCI 10de:1094 (P#540672 busid=0000:84:00.0 class=0302(3D) 
PCIVendor="NVIDIA Corporation" PCIDevice="Tesla M2075 Dual-Slot Computing 
Processor Module") "NVIDIA Corporation Tesla M2075 Dual-Slot Computing 
Processor Module"
    Co-Processor L#5 (CoProcType=CUDA Backend=CUDA GPUVendor="NVIDIA 
Corporation" GPUModel="Tesla M2075" CUDAGlobalMemorySize=5428224 
CUDAL2CacheSize=768 CUDAMultiProcessors=14 CUDACoresPerMP=32 
CUDASharedMemorySizePerMP=48) "cuda2"

On recent kernels, you would see both a "cardX" GPU osdev and a "cudaX" 
COPROC osdev in the PCI device. There can even be "nvmlX" and ":0.0" if 
you have the nvml and nvctrl libraries. Those are basically different ways 
to talk the GPU (Linux kernel DRM, CUDA, etc).

Given that I have never seen anybody use "cardX" for placing task/data 
near a GPU, I am wondering if we should disable those by default. Or maybe 
rename "GPU" into something that wouldn't attract people as much, maybe 
"DRM".

Does this mean that the cards are not configured correctly? Or is there an 
additional flag that needs to be set to get this information? 

Make sure "cuda" appears in the summary at the end of the configure.

Currently the code does: 
  hwloc_topology_init(&machine_topology);
  hwloc_topology_set_flags(machine_topology, 
HWLOC_TOPOLOGY_FLAG_IO_DEVICES);
  hwloc_topology_load(machine_topology); 
And this is enough to identify the CPUs and GPUs, but any additional 
information - particularly the device and vendor id's - seem to not be 
there.  
I tried this with the most recent release (1.11.7) and saw the same 
results.    
We tried this on a variety of PowerPC machines and I think even some 
x86_64 machines with similar results.    
Thoughts?
Dave

BTW, it looks like you're not going to the OMPI dev meeting next week. 
I'll be there if one of your colleague wants to discuss this face to face.

Brice
_______________________________________________
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

_______________________________________________
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

_______________________________________________
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Why do I get such little information back about GPU's on my system

Reply via email to