Re: [hwloc-users] Using hwloc to map GPU layout on system

2014-02-14 Thread Brock Palen

On Feb 7, 2014, at 9:45 AM, Brice Goglin  wrote:

> Le 06/02/2014 21:31, Brock Palen a écrit :
>> Actually that did turn out to help. The nvml# devices appear to be numbered 
>> in the way that CUDA_VISABLE_DEVICES sees them, while the cuda# devices are 
>> in the order that PBS and nvidia-smi see them.
> 
> By the way, did you have CUDA_VISIBLE_DEVICES set during the lstopo below? 
> Was it set to 2,3,0,1 ? That would explain the reordering.

It was not set, and I have double checked it just now to be sure.

> 
> I am not sure in which order you want to do things in the end. One way that 
> could help is:
> * Get the locality of each GPU by doing CUDA_VISIBLE_DEVICES=x (for x in 
> 0..number of gpus-1). Each iteration gives a single GPU in hwloc, and you can 
> retrieve the corresponding locality from the cuda0 object.
> * Once you know which GPUs you want based on the locality info, take the 
> corresponding #x and put them in CUDA_VISIBLE_DEVICES=x,y before you run your 
> program. hwloc will create cuda0 for x and cuda1 for y.

The cuda ID's match the order if you run nvidia-smi  (which gives you PCI 
addresses)

The nvml id's  match the order in which they start.  That is 
CUDA_VISIBLE_DEVICES=0, cudaSetDevice(0) matches nvml0  which matches id 2 for 
CoProc cuda2 and for nvidia-smi id 2.

This appears to be very consistent between reboots.
te
> 
> If you don't set CUDA_VISIBLE_DEVICES, cuda* objects are basically 
> out-of-order. nvml objects are (a bit less likely) ordered by PCI bus is 
> (lstopo -v would confirm that).

Yes the nvml and what is ordering is by ascending PCI ID,  nvidia-smi shows 
this:

[root@nyx7500 ~]# nvidia-smi | grep Tesla
|   0  Tesla K20Xm Off  | :09:00.0 Off |0 |
|   1  Tesla K20Xm Off  | :0A:00.0 Off |0 |
|   2  Tesla K20Xm Off  | :0D:00.0 Off |0 |
|   3  Tesla K20Xm Off  | :0E:00.0 Off |0 |
|   4  Tesla K20Xm Off  | :28:00.0 Off |0 |
|   5  Tesla K20Xm Off  | :2B:00.0 Off |0 |
|   6  Tesla K20Xm Off  | :30:00.0 Off |0 |
|   7  Tesla K20Xm Off  | :33:00.0 Off |0 |

[root@nyx7500 ~]# lstopo -v
Machine (P#0 total=67073288KB DMIProductName="ProLiant SL270s Gen8   " 
DMIProductVersion= DMIProductSerial="USE3267A92  " 
DMIProductUUID=36353439-3437-5553-4533-323637413932 DMIBoardVendor=HP 
DMIBoardName= DMIBoardVersion= DMIBoardSerial="USE3267A92  " 
DMIBoardAssetTag="" DMIChassisVendor=HP DMIChassisType=25 
DMIChassisVersion= DMIChassisSerial="USE3267A90  " DMIChassisAssetTag=" 
   " DMIBIOSVendor=HP DMIBIOSVersion=P75 DMIBIOSDate=09/18/2013 DMISysVendor=HP 
Backend=Linux LinuxCgroup=/ OSName=Linux OSRelease=2.6.32-358.23.2.el6.x86_64 
OSVersion="#1 SMP Sat Sep 14 05:32:37 EDT 2013" 
HostName=nyx7500.engin.umich.edu Architecture=x86_64)
  NUMANode L#0 (P#0 local=33518860KB total=33518860KB)
Socket L#0 (P#0 CPUModel="Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz" 
CPUVendor=GenuineIntel CPUModelNumber=45 CPUFamilyNumber=6)
  L3Cache L#0 (size=20480KB linesize=64 ways=20)
L2Cache L#0 (size=256KB linesize=64 ways=8)
  L1dCache L#0 (size=32KB linesize=64 ways=8)
L1iCache L#0 (size=32KB linesize=64 ways=8)
  Core L#0 (P#0)
PU L#0 (P#0)
L2Cache L#1 (size=256KB linesize=64 ways=8)
  L1dCache L#1 (size=32KB linesize=64 ways=8)
L1iCache L#1 (size=32KB linesize=64 ways=8)
  Core L#1 (P#1)
PU L#1 (P#1)
L2Cache L#2 (size=256KB linesize=64 ways=8)
  L1dCache L#2 (size=32KB linesize=64 ways=8)
L1iCache L#2 (size=32KB linesize=64 ways=8)
  Core L#2 (P#2)
PU L#2 (P#2)
L2Cache L#3 (size=256KB linesize=64 ways=8)
  L1dCache L#3 (size=32KB linesize=64 ways=8)
L1iCache L#3 (size=32KB linesize=64 ways=8)
  Core L#3 (P#3)
PU L#3 (P#3)
L2Cache L#4 (size=256KB linesize=64 ways=8)
  L1dCache L#4 (size=32KB linesize=64 ways=8)
L1iCache L#4 (size=32KB linesize=64 ways=8)
  Core L#4 (P#4)
PU L#4 (P#4)
L2Cache L#5 (size=256KB linesize=64 ways=8)
  L1dCache L#5 (size=32KB linesize=64 ways=8)
L1iCache L#5 (size=32KB linesize=64 ways=8)
  Core L#5 (P#5)
PU L#5 (P#5)
L2Cache L#6 (size=256KB linesize=64 ways=8)
  L1dCache L#6 (size=32KB linesize=64 ways=8)
L1iCache L#6 (size=32KB linesize=64 ways=8)
  Core L#6 (P#6)
PU L#6 (P#6)
L2Cache L#7 (size=256KB linesize=64 ways=8)
  L1dCache L#7 (size=32KB linesize=64 ways=8)
L1iCache L#7 (size=32KB linesize=64 ways=8)
  

Re: [hwloc-users] Using hwloc to map GPU layout on system

2014-02-07 Thread Brice Goglin
Le 06/02/2014 21:31, Brock Palen a écrit :
> Actually that did turn out to help. The nvml# devices appear to be numbered 
> in the way that CUDA_VISABLE_DEVICES sees them, while the cuda# devices are 
> in the order that PBS and nvidia-smi see them.

By the way, did you have CUDA_VISIBLE_DEVICES set during the lstopo
below? Was it set to 2,3,0,1 ? That would explain the reordering.

I am not sure in which order you want to do things in the end. One way
that could help is:
* Get the locality of each GPU by doing CUDA_VISIBLE_DEVICES=x (for x in
0..number of gpus-1). Each iteration gives a single GPU in hwloc, and
you can retrieve the corresponding locality from the cuda0 object.
* Once you know which GPUs you want based on the locality info, take the
corresponding #x and put them in CUDA_VISIBLE_DEVICES=x,y before you run
your program. hwloc will create cuda0 for x and cuda1 for y.

If you don't set CUDA_VISIBLE_DEVICES, cuda* objects are basically
out-of-order. nvml objects are (a bit less likely) ordered by PCI bus is
(lstopo -v would confirm that).

Brice



>
>   PCIBridge
> PCIBridge
>   PCIBridge
> PCI 10de:1021
>   CoProc L#2 "cuda0"
>   GPU L#3 "nvml2"
>   PCIBridge
> PCI 10de:1021
>   CoProc L#4 "cuda1"
>   GPU L#5 "nvml3"
>   PCIBridge
> PCIBridge
>   PCIBridge
> PCI 10de:1021
>   CoProc L#6 "cuda2"
>   GPU L#7 "nvml0"
>   PCIBridge
> PCI 10de:1021
>   CoProc L#8 "cuda3"
>   GPU L#9 "nvml1"
>
>
> Right now I am trying to create a python script that will take the XML output 
> of lstopo and give me just the cuda and nvml devices in order. 
>
> I dont' know if some value are deterministic though.  Could I  ignore the 
> CoProc line and just use the:
>
>   GPU L#3 "nvml2"
>   GPU L#5 "nvml3"
>   GPU L#7 "nvml0"
>   GPU L#9 "nvml1"
>
> Is the L# always going to be in the oder I would expect?  Because then I 
> already have my map then. 



Brice


>
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
>
>
>
> On Feb 5, 2014, at 1:19 AM, Brice Goglin  wrote:
>
>> Hello Brock,
>>
>> Some people reported the same issue in the past and that's why we added the 
>> "nvml" objects. CUDA reorders devices by "performance". Batch-schedulers are 
>> somehow supposed to use "nvml" for managing GPUs without actually using them 
>> with CUDA directly. And the "nvml" order is the "normal" order.
>>
>> You need "tdk" (https://developer.nvidia.com/tesla-deployment-kit) to get 
>> nvml library and development headers installed. Then hwloc can build its 
>> "nvml" backend. Once ready, you'll see a hwloc "cudaX" and a hwloc "nvmlY" 
>> object in each NVIDIA PCI devices, and you can get their locality as usual.
>>
>> Does this help?
>>
>> Brice
>>
>>
>>
>> Le 05/02/2014 05:25, Brock Palen a écrit :
>>> We are trying to build a system to mask users to the GPU's they were 
>>> assigned by our batch system (torque).
>>>
>>> The batch system sets the GPU's into thread exclusive mode when assigned to 
>>> a job, so we want the GPU that the batch system assigns to be the one set 
>>> in CUDA_VISIBLE_DEVICES,
>>>
>>> Problem is on our nodes what the batch system sees as gpu 0  is not the 
>>> same GPU that CUDA_VISIBLE_DEVICES sees as 0.   Actually 0  is 2.
>>>
>>> You can see this behavior is you run 
>>>
>>> nvidia-smi  and look at the PCI ID's of the devices.  You can then look at 
>>> the PCI ID's outputed by deviceQuery from the SDK examples and see they are 
>>> in a different order.
>>>
>>> The ID's you would set in CUDA_VISIBLE_DEVICES matches the order that 
>>> deviceQuery sees, not the order that nvida-smi sees.
>>>
>>> Example (All values turned to decimal to match deviceQuery):
>>>
>>> nvidia-smi order: 9, 10, 13, 14, 40, 43, 48, 51
>>> dviceQuery order: 13, 14, 9, 10, 40, 43, 48, 51
>>>
>>>
>>> Can hwloc help me with this?  Right now I am hacking a script based on the 
>>> output of the two commands, and making a map, between the two and then set 
>>> CUDA_VISIBLE_DEVICES
>>>
>>> Any ideas would be great. Later as we currently also use CPU sets, we want 
>>> to pass GPU locality information to the scheduler to make decisions to 
>>> match GPU-> CPU Socket information, as performance of threads across QPI 
>>> domains is very poor. 
>>>
>>> Thanks
>>>
>>> Machine (64GB)
>>>   NUMANode L#0 (P#0 32GB)
>>> Socket L#0 + L3 L#0 (20MB)
>>>   L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 
>>> (P#0)
>>>   L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 
>>> (P#1)
>>>   L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 
>>> (P#2)
>>>   L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 
>>> 

Re: [hwloc-users] Using hwloc to map GPU layout on system

2014-02-06 Thread Brice Goglin
Le 06/02/2014 21:31, Brock Palen a écrit :
> Actually that did turn out to help. The nvml# devices appear to be numbered 
> in the way that CUDA_VISABLE_DEVICES sees them, while the cuda# devices are 
> in the order that PBS and nvidia-smi see them.
>
>   PCIBridge
> PCIBridge
>   PCIBridge
> PCI 10de:1021
>   CoProc L#2 "cuda0"
>   GPU L#3 "nvml2"
>   PCIBridge
> PCI 10de:1021
>   CoProc L#4 "cuda1"
>   GPU L#5 "nvml3"
>   PCIBridge
> PCIBridge
>   PCIBridge
> PCI 10de:1021
>   CoProc L#6 "cuda2"
>   GPU L#7 "nvml0"
>   PCIBridge
> PCI 10de:1021
>   CoProc L#8 "cuda3"
>   GPU L#9 "nvml1"
>
>
> Right now I am trying to create a python script that will take the XML output 
> of lstopo and give me just the cuda and nvml devices in order. 
>
> I dont' know if some value are deterministic though.  Could I  ignore the 
> CoProc line and just use the:
>
>   GPU L#3 "nvml2"
>   GPU L#5 "nvml3"
>   GPU L#7 "nvml0"
>   GPU L#9 "nvml1"
>
> Is the L# always going to be in the oder I would expect?  Because then I 
> already have my map then. 

I am surprised that the CUDA and NVML orders are different here. I was
told that CUDA was reordering by computing power, but your GPUs appear
to be identical models. So CUDA may be reordering based on another criteria.

Just found this in the nvml doc "The order in which NVML enumerates
units has no guarantees of consistency between reboots. For that reason
it is recommended that devices be looked up by their PCI ids or board
serial numbers. See nvmlDeviceGetHandleBySerial() and
nvmlDeviceGetHandleByPciBusId()."

At least, is the NVML order following the PCI bus order?

We may want to talk to NVIDIA to get a clarification about all this.

Brice



>
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
>
>
>
> On Feb 5, 2014, at 1:19 AM, Brice Goglin  wrote:
>
>> Hello Brock,
>>
>> Some people reported the same issue in the past and that's why we added the 
>> "nvml" objects. CUDA reorders devices by "performance". Batch-schedulers are 
>> somehow supposed to use "nvml" for managing GPUs without actually using them 
>> with CUDA directly. And the "nvml" order is the "normal" order.
>>
>> You need "tdk" (https://developer.nvidia.com/tesla-deployment-kit) to get 
>> nvml library and development headers installed. Then hwloc can build its 
>> "nvml" backend. Once ready, you'll see a hwloc "cudaX" and a hwloc "nvmlY" 
>> object in each NVIDIA PCI devices, and you can get their locality as usual.
>>
>> Does this help?
>>
>> Brice
>>
>>
>>
>> Le 05/02/2014 05:25, Brock Palen a écrit :
>>> We are trying to build a system to mask users to the GPU's they were 
>>> assigned by our batch system (torque).
>>>
>>> The batch system sets the GPU's into thread exclusive mode when assigned to 
>>> a job, so we want the GPU that the batch system assigns to be the one set 
>>> in CUDA_VISIBLE_DEVICES,
>>>
>>> Problem is on our nodes what the batch system sees as gpu 0  is not the 
>>> same GPU that CUDA_VISIBLE_DEVICES sees as 0.   Actually 0  is 2.
>>>
>>> You can see this behavior is you run 
>>>
>>> nvidia-smi  and look at the PCI ID's of the devices.  You can then look at 
>>> the PCI ID's outputed by deviceQuery from the SDK examples and see they are 
>>> in a different order.
>>>
>>> The ID's you would set in CUDA_VISIBLE_DEVICES matches the order that 
>>> deviceQuery sees, not the order that nvida-smi sees.
>>>
>>> Example (All values turned to decimal to match deviceQuery):
>>>
>>> nvidia-smi order: 9, 10, 13, 14, 40, 43, 48, 51
>>> dviceQuery order: 13, 14, 9, 10, 40, 43, 48, 51
>>>
>>>
>>> Can hwloc help me with this?  Right now I am hacking a script based on the 
>>> output of the two commands, and making a map, between the two and then set 
>>> CUDA_VISIBLE_DEVICES
>>>
>>> Any ideas would be great. Later as we currently also use CPU sets, we want 
>>> to pass GPU locality information to the scheduler to make decisions to 
>>> match GPU-> CPU Socket information, as performance of threads across QPI 
>>> domains is very poor. 
>>>
>>> Thanks
>>>
>>> Machine (64GB)
>>>   NUMANode L#0 (P#0 32GB)
>>> Socket L#0 + L3 L#0 (20MB)
>>>   L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 
>>> (P#0)
>>>   L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 
>>> (P#1)
>>>   L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 
>>> (P#2)
>>>   L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 
>>> (P#3)
>>>   L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 
>>> (P#4)
>>>   L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 
>>> 

Re: [hwloc-users] Using hwloc to map GPU layout on system

2014-02-06 Thread Samuel Thibault
Brock Palen, le Thu 06 Feb 2014 21:31:42 +0100, a écrit :
>   GPU L#3 "nvml2"
>   GPU L#5 "nvml3"
>   GPU L#7 "nvml0"
>   GPU L#9 "nvml1"
> 
> Is the L# always going to be in the oder I would expect?  Because then I 
> already have my map then. 

No, L# is just following the machine topology. CUDA numbering does not
necessarily follows that (e.g. if a slow GPU is somewhere in the middle).

Samuel


Re: [hwloc-users] Using hwloc to map GPU layout on system

2014-02-06 Thread Brock Palen
Actually that did turn out to help. The nvml# devices appear to be numbered in 
the way that CUDA_VISABLE_DEVICES sees them, while the cuda# devices are in the 
order that PBS and nvidia-smi see them.

  PCIBridge
PCIBridge
  PCIBridge
PCI 10de:1021
  CoProc L#2 "cuda0"
  GPU L#3 "nvml2"
  PCIBridge
PCI 10de:1021
  CoProc L#4 "cuda1"
  GPU L#5 "nvml3"
  PCIBridge
PCIBridge
  PCIBridge
PCI 10de:1021
  CoProc L#6 "cuda2"
  GPU L#7 "nvml0"
  PCIBridge
PCI 10de:1021
  CoProc L#8 "cuda3"
  GPU L#9 "nvml1"


Right now I am trying to create a python script that will take the XML output 
of lstopo and give me just the cuda and nvml devices in order. 

I dont' know if some value are deterministic though.  Could I  ignore the 
CoProc line and just use the:

  GPU L#3 "nvml2"
  GPU L#5 "nvml3"
  GPU L#7 "nvml0"
  GPU L#9 "nvml1"

Is the L# always going to be in the oder I would expect?  Because then I 
already have my map then. 

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



On Feb 5, 2014, at 1:19 AM, Brice Goglin  wrote:

> Hello Brock,
> 
> Some people reported the same issue in the past and that's why we added the 
> "nvml" objects. CUDA reorders devices by "performance". Batch-schedulers are 
> somehow supposed to use "nvml" for managing GPUs without actually using them 
> with CUDA directly. And the "nvml" order is the "normal" order.
> 
> You need "tdk" (https://developer.nvidia.com/tesla-deployment-kit) to get 
> nvml library and development headers installed. Then hwloc can build its 
> "nvml" backend. Once ready, you'll see a hwloc "cudaX" and a hwloc "nvmlY" 
> object in each NVIDIA PCI devices, and you can get their locality as usual.
> 
> Does this help?
> 
> Brice
> 
> 
> 
> Le 05/02/2014 05:25, Brock Palen a écrit :
>> We are trying to build a system to mask users to the GPU's they were 
>> assigned by our batch system (torque).
>> 
>> The batch system sets the GPU's into thread exclusive mode when assigned to 
>> a job, so we want the GPU that the batch system assigns to be the one set in 
>> CUDA_VISIBLE_DEVICES,
>> 
>> Problem is on our nodes what the batch system sees as gpu 0  is not the same 
>> GPU that CUDA_VISIBLE_DEVICES sees as 0.   Actually 0  is 2.
>> 
>> You can see this behavior is you run 
>> 
>> nvidia-smi  and look at the PCI ID's of the devices.  You can then look at 
>> the PCI ID's outputed by deviceQuery from the SDK examples and see they are 
>> in a different order.
>> 
>> The ID's you would set in CUDA_VISIBLE_DEVICES matches the order that 
>> deviceQuery sees, not the order that nvida-smi sees.
>> 
>> Example (All values turned to decimal to match deviceQuery):
>> 
>> nvidia-smi order: 9, 10, 13, 14, 40, 43, 48, 51
>> dviceQuery order: 13, 14, 9, 10, 40, 43, 48, 51
>> 
>> 
>> Can hwloc help me with this?  Right now I am hacking a script based on the 
>> output of the two commands, and making a map, between the two and then set 
>> CUDA_VISIBLE_DEVICES
>> 
>> Any ideas would be great. Later as we currently also use CPU sets, we want 
>> to pass GPU locality information to the scheduler to make decisions to match 
>> GPU-> CPU Socket information, as performance of threads across QPI domains 
>> is very poor. 
>> 
>> Thanks
>> 
>> Machine (64GB)
>>   NUMANode L#0 (P#0 32GB)
>> Socket L#0 + L3 L#0 (20MB)
>>   L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 
>> (P#0)
>>   L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 
>> (P#1)
>>   L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 
>> (P#2)
>>   L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 
>> (P#3)
>>   L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 
>> (P#4)
>>   L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 
>> (P#5)
>>   L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 
>> (P#6)
>>   L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 
>> (P#7)
>> HostBridge L#0
>>   PCIBridge
>> PCI 1000:0087
>>   Block L#0 "sda"
>>   Block L#1 "sdb"
>>   PCIBridge
>> PCIBridge
>>   PCIBridge
>> PCI 10de:1021
>>   CoProc L#2 "cuda0"
>>   PCIBridge
>> PCI 10de:1021
>>   CoProc L#3 "cuda1"
>>   PCIBridge
>> PCIBridge
>>   PCIBridge
>> PCI 10de:1021
>>   CoProc L#4 "cuda2"
>>   PCIBridge
>> PCI 10de:1021
>>   CoProc L#5 "cuda3"
>>   PCIBridge
>> PCI 8086:1521
>>   Net L#6 "eth0"
>> PCI 8086:1521
>>  

Re: [hwloc-users] Using hwloc to map GPU layout on system

2014-02-05 Thread Brice Goglin
Hello Brock,

Some people reported the same issue in the past and that's why we added
the "nvml" objects. CUDA reorders devices by "performance".
Batch-schedulers are somehow supposed to use "nvml" for managing GPUs
without actually using them with CUDA directly. And the "nvml" order is
the "normal" order.

You need "tdk" (https://developer.nvidia.com/tesla-deployment-kit) to
get nvml library and development headers installed. Then hwloc can build
its "nvml" backend. Once ready, you'll see a hwloc "cudaX" and a hwloc
"nvmlY" object in each NVIDIA PCI devices, and you can get their
locality as usual.

Does this help?

Brice



Le 05/02/2014 05:25, Brock Palen a écrit :
> We are trying to build a system to mask users to the GPU's they were assigned 
> by our batch system (torque).
>
> The batch system sets the GPU's into thread exclusive mode when assigned to a 
> job, so we want the GPU that the batch system assigns to be the one set in 
> CUDA_VISIBLE_DEVICES,
>
> Problem is on our nodes what the batch system sees as gpu 0  is not the same 
> GPU that CUDA_VISIBLE_DEVICES sees as 0.   Actually 0  is 2.
>
> You can see this behavior is you run 
>
> nvidia-smi  and look at the PCI ID's of the devices.  You can then look at 
> the PCI ID's outputed by deviceQuery from the SDK examples and see they are 
> in a different order.
>
> The ID's you would set in CUDA_VISIBLE_DEVICES matches the order that 
> deviceQuery sees, not the order that nvida-smi sees.
>
> Example (All values turned to decimal to match deviceQuery):
>
> nvidia-smi order: 9, 10, 13, 14, 40, 43, 48, 51
> dviceQuery order: 13, 14, 9, 10, 40, 43, 48, 51
>
>
> Can hwloc help me with this?  Right now I am hacking a script based on the 
> output of the two commands, and making a map, between the two and then set 
> CUDA_VISIBLE_DEVICES
>
> Any ideas would be great. Later as we currently also use CPU sets, we want to 
> pass GPU locality information to the scheduler to make decisions to match 
> GPU-> CPU Socket information, as performance of threads across QPI domains is 
> very poor. 
>
> Thanks
>
> Machine (64GB)
>   NUMANode L#0 (P#0 32GB)
> Socket L#0 + L3 L#0 (20MB)
>   L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 
> (P#0)
>   L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 
> (P#1)
>   L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 
> (P#2)
>   L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 
> (P#3)
>   L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 
> (P#4)
>   L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 
> (P#5)
>   L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 
> (P#6)
>   L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 
> (P#7)
> HostBridge L#0
>   PCIBridge
> PCI 1000:0087
>   Block L#0 "sda"
>   Block L#1 "sdb"
>   PCIBridge
> PCIBridge
>   PCIBridge
> PCI 10de:1021
>   CoProc L#2 "cuda0"
>   PCIBridge
> PCI 10de:1021
>   CoProc L#3 "cuda1"
>   PCIBridge
> PCIBridge
>   PCIBridge
> PCI 10de:1021
>   CoProc L#4 "cuda2"
>   PCIBridge
> PCI 10de:1021
>   CoProc L#5 "cuda3"
>   PCIBridge
> PCI 8086:1521
>   Net L#6 "eth0"
> PCI 8086:1521
>   Net L#7 "eth1"
>   PCIBridge
> PCI 102b:0533
>   PCI 8086:1d02
>   NUMANode L#1 (P#1 32GB)
> Socket L#1 + L3 L#1 (20MB)
>   L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 
> (P#8)
>   L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 
> (P#9)
>   L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU 
> L#10 (P#10)
>   L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU 
> L#11 (P#11)
>   L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU 
> L#12 (P#12)
>   L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU 
> L#13 (P#13)
>   L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU 
> L#14 (P#14)
>   L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU 
> L#15 (P#15)
> HostBridge L#12
>   PCIBridge
> PCIBridge
>   PCIBridge
> PCI 15b3:1003
>   Net L#8 "eth2"
>   Net L#9 "ib0"
>   Net L#10 "eoib0"
>   OpenFabrics L#11 "mlx4_0"
>   PCIBridge
> PCIBridge
>   PCIBridge
> PCI 10de:1021
>   CoProc L#12 "cuda4"
>   PCIBridge
> PCI 10de:1021
>   CoProc L#13 "cuda5"
>   PCIBridge
> PCIBridge
>   PCIBridge
> PCI 10de:1021
>   CoProc L#14 "cuda6"
>   PCIBridge
> PCI 10de:1021
>