Hi Peter, Gentle ping on this. I wanted to check if the assembly analysis in my previous reply changed the picture at all.
You were right that the commit message was misleading about the total size increase: it's 9 bytes per call site, not just the NOP. That said, when I looked at the executed path with the tracepoint disabled, the only addition is the 2-byte NOP (xchg %ax,%ax). Both the baseline and instrumented _raw_spin_unlock() fit within a single 64-byte cache line, and I wasn't able to measure any difference with locktorture: lock() cost completely dominates, unlock() accounts for less than 1% of the total, so any overhead is indistinguishable from noise. If the cost is still a concern, I see two possible paths forward: 1. Guard the spinlock/qrwlock instrumentation behind a Kconfig option (disabled by default), so only kernels that explicitly opt in pay the cost. 2. Drop the spinlock/qrwlock instrumentation entirely and keep contended_release for sleepable locks only. Happy to go whichever direction you prefer.
