[+cc Ilpo, Jonathan] On Thu, Jul 17, 2025 at 11:28:26AM -0600, Matthew W Carlis wrote: > A bit late to the discussion here.. Looks like "too late" in fact, > but I wanted to just make some comments.
Not too late, thanks for your thoughts! When I apply things, I consider them a draft with intention to go upstream, but not immutable. If it makes sense to revise or postpone, we can still do that. > On Tue, 12 May 2025, Shuai Xue wrote: > > Hotplug events are critical indicators for analyzing hardware > > health, > > In terms of a "hot plug" event I'm not actually sure what that > means. I mean to say that the spec has some room for different > implementations. I think sometimes that means a presence detect > state change event, but a system is not required to implement a > presence pin (at least not for the Slot Status presence). Some > vendors support an "inband" presence which is when the LTSSM > essentially asserts presence if the link is active and deasserts it > when the link is down. > > Appendix I in the newer PCIe specs say to use data link layer state > change event if presence is not implemented. It looks like this > tracepoint would still work, but its just something to keep in mind. > At the risk of including too much information I could see it also > being useful to put the device/vendor IDs of the DSP and the EP into > the trace event for link up. Perhaps even the link speed/width cap > for DSP/EP. The real challenge with tracking a fleet is getting all > the things you care about into one place. > > On Tue, 20 May 2025, Lukas Wunner wrote: > > Link speed changes and device plug/unplug events are orthogonal > > I guess what I wanted to get at here were some of the discussion > from Lukas & Ilpo. I think it makes sense to separate presence > events from link events, but I think it would make sense to have a > "link tracepoint" which reports previous and new speed. One of those > speeds being DOWN/DISABLED etc. Width could be in there as well. I > have seen many times now an engineer become confused about checking > speed because "Current Link Speed" & "Negotiated Link Width" are > "undefined" when "Data Link Layer Active" bit is unset. Ideally a > solution here would be immediately clear to the user. > > When it comes to tracking things across a "fleet" having the slot > number of the device is extremely useful. We have an internal > specification for our slot number assignments that allows us to > track meaning across different generations of hardware or different > architectures. The BDF is often changing between generations, but > the meaning of the slot is not. All the tracepoints here already include: - pci_name() (the bus/device/function) - slot_name() (which I think comes from make_slot_name(); would you want something else?) and IIUC, it would be helpful for you to add: - DSP Vendor/Device ID (the Root Port or Switch Downstream Port, which is relatively static, so seems less useful to me than the USP/EP would be) - USP/EP Vendor/Device ID And you would consider adding a new format for "Link Up" that would include the above plus current link speed/width? I expect we will likely see new tracepoints similar to "Link Up" for link speed/width changes done by bwctrl, and this would definitely make sense for those. As a consumer of tracepoints, do you have an opinion on the event string? I wonder if spaces in the strings complicate searching and scripting? I don't think tracepoints necessarily need to match text in dmesg exactly because I suspect they're mostly processed mechanically. But I'm not a tracepoint user myself (yet), and about 20% of existing tracepoints already include spaces, so maybe it's not a concern. Bjorn