Hello, We currently have three GPU-related branches: (1) a (old) CUDA branch that adds "cuda0", "cuda1", ... devices inside PCI devices and then puts Core and Memory in there to describe the GPU internals. (2) a (new) NVML branch that adds "nvml0", "nvml1", ... devices inside NVIDIA GPU PCI devices (the order can be different in NVML and CUDA). This is used by batch schedulers to retrieve NVIDIA GPU locality. (3) a (new) OpenCL branch that adds "opencl0p0", ... devices inside AMD GPU PCI devices.
I am going to merge the basic of (1), (2) and (3) by the end of the year so that users can easily retrieve the locality of CUDA/NVML/OpenCL device. They'll have functions to convert the device pointer into hwloc object, a device index into object, or a device pointer into a cpuset. The main drawback of this is that the initialization of these libs can be slow (about 1-2s added to lstopo since it enables I/O by default) if poorly configured (NVIDIA puts GPGPU device in non-persistent mode by default, and AMD GPGPU are slower if DISPLAY isn't set to :0). I will document how to avoid such issues, not sure it's worth disabling all this plugins by default. Then we'll talk about the remaining part of (1) (GPU internals), I still need to see if we can do something similar with OpenCL, find out which numbers of compute units, SIMD units, SIMD width actually matter to users, and if we can report all this in a somehow portable way. Brice