Dear Prometheus Developers, I'm working on a feature to collect PCIe devices' link status.
# Goal The link status of PCIe devices sometimes changes, like link or speed downgrades, and devices disappear. Such failure often happens for servers with many PCIe devices (a bunch of NVMes or GPUs). I'd like to detect such failures with a PCIe device collector. # Proposal ## Approach: detect the current pcie device status from sysfs Each device has a directory like `/sys/devices/pci0000:00/0000:00:01.3/0000:09:00.0` . It contains useful information files like: - max_link_speed - `8.0 GT/s PCIe` - max_link_width - `4` - current_link_speed - `8.0 GT/s PCIe` - current_link_width - `4` - class - `0x010802` - vendor - `0x144d` - subsystem_vendor - `0x144d` - subsystem_device - `0xa801` - device - `0xa809` Also, the path to the folder indicates: - segment (0000) - parent bus (00:01.3) - device bus (09:00.0) This should be included in metrics to check pci bus speed degradation hierarchically (i.e. check device bus speed and check pcie switch speed). ## Current status: I've implemented a PoC collector and exporter for procfs and node_exporter. PR(procfs): https://github.com/prometheus/procfs/pull/728 PR(node_exporter): https://github.com/prometheus/node_exporter/pull/3339 This is an example of exported metrics. ``` # HELP node_pcidevice_info Non-numeric data from /sys/bus/pci/devices/<location>, value is always 1. # TYPE node_pcidevice_info gauge node_pcidevice_info{bus="00",class_id="0x60000",device="00",device_id="0x1630",function="0",parent_bus="*",parent_device="*",parent_function="*",parent_segment="*",segment="0000",subsystem_device_id="0x5095",subsystem_vendor_id="0x17aa",vendor_id="0x1022"} 1 node_pcidevice_info{bus="01",class_id="0x10802",device="00",device_id="0x540a",function="0",parent_bus="00",parent_device="02",parent_function="1",parent_segment="0000",segment="0000",subsystem_device_id="0x5021",subsystem_vendor_id="0xc0a9",vendor_id="0xc0a9"} 1 # HELP node_pcidevice_max_link_speed Value of maximum link speed (GT/s) # TYPE node_pcidevice_max_link_speed gauge node_pcidevice_max_link_speed{bus="00",device="02",function="1",segment="0000"} 8 node_pcidevice_max_link_speed{bus="00",device="02",function="2",segment="0000"} 8 # HELP node_pcidevice_current_link_speed Value of current link speed (GT/s) # TYPE node_pcidevice_current_link_speed gauge node_pcidevice_current_link_speed{bus="00",device="02",function="1",segment="0000"} 8 node_pcidevice_current_link_speed{bus="00",device="02",function="2",segment="0000"} 2.5 # HELP node_pcidevice_max_link_width Value of maximum link width (number of lanes) # TYPE node_pcidevice_max_link_width gauge node_pcidevice_max_link_width{bus="00",device="02",function="1",segment="0000"} 8 node_pcidevice_max_link_width{bus="00",device="02",function="2",segment="0000"} 1 # HELP node_pcidevice_current_link_width Value of current link width (number of lanes) # TYPE node_pcidevice_current_link_width gauge node_pcidevice_current_link_width{bus="00",device="02",function="1",segment="0000"} 4 node_pcidevice_current_link_width{bus="00",device="02",function="2",segment="0000"} 1 ``` I'm looking forward to any feedback or suggestions to make this better! Thanks, Naoki MATSUMOTO -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/prometheus-developers/CAJQdmQFvG5oF32kF07byZPzUNBr5o2gr2zGquiA3QkeNJNa4_g%40mail.gmail.com.