I found a new approach. By emitting histograms dynamically according to gaps in the address space, we can skip encoding long regions of mostly-zero-bins.
https://review.openocd.org/c/openocd/+/8739/ (plus the others) Compared to increasing the single-histogram bucket count to fit the target address space, this approach generates much smaller gmon files for many systems. In some cases, this approach may generate larger files than the previous 128KBucket encoder, but solves the problem where sparse address space systems might end up with histogram bins larger than functions. On ESP32-S3, it was common to see histogram bins >200B when using only a few of the memory interfaces. For compatibility with existing gprof builds, we round each histogram bin to 2 bytes for compatibility with existing gprof builds (before a future binutils 2.45 build). So not quite instruction-accurate on x86 and Xtensa but within 1 instruction. CPUs with instruction-sizes divisible by two are instruction-accurate. -Richard