I suggest you use the linux scheduler tracing facilities, e.g. perf shed to diagnose (see https://www.brendangregg.com/perf.html#SchedulerAnalysis <https://www.brendangregg.com/perf.html#SchedulerAnalysis>)
But you may have to also do this outside the VM - to schedule trace the VM. In general, timing at 1ms intervals with a 2% error is probably all you can expect running in a VM - especially if you don’t know (or manage) the load on the bare metal. 2% on 1ms is 20 us - and a raw bare metal scheduler will probably only get you to 5-6 us. You can simplify your analysis by testing on bare metal first. > On Feb 11, 2022, at 10:05 PM, Rich Moyse <r...@moyse.us> wrote: > > TLDR > I’d be grateful if someone could either confirm or dismiss the effect of OS, > go runtime, and/or Hypervisor latency on the output of a goroutine trace. > The trace report displays empty gaps of 3-14ms where expected periodic > goroutines should execute, as there’s a goroutine triggered by each 1ms timer > tick. Attempts to select items within these gaps return only five “counter” > items - nothing else. Those counter items suggest that nothing is running. > Perhaps external preemptive behavior interrupts the running go process, > thereby suspending the go runtime, causing these gaps? If so, can these > trace gaps be attributed to OS and/or Hypervisor scheduler preemption? > > > Although a recent (2/4/2022) go-nuts summary mentions a post to the thread: > Issues when using time.Ticker microsecond duration > <https://groups.google.com/g/golang-nuts/c/3GUnjdoWjqU/m/YFHL1ZruAwAJ>, whose > other posts discuss a range of causes, including OS induced latency, this > thread attempts to understand how external latency sources affect the trace > report. > > In addition to the problem description below, the go code and trace screen > images are available via this gist > <https://gist.github.com/WhisperingChaos/da17b8cb06d66d5d95a5285109b5af93> > while the trace file is available here > <https://github.com/WhisperingChaos/traceissue/raw/main/dply/trace/trace.1.18beta1.out>. > The go program was compiled first using 1.18beta1 and later with 1.17. > It’s running in a 16.04 Ubuntu Desktop VM that’s managed by Hyper-V on a > physical machine running Windows 10 Pro. The go program limits itself to the > interplay between two goroutines: a single sending goroutine and a single > receiving one. > > These goroutines share an unbuffered channel. The sender is only throttled > by the capacity of the receiver to consume its messages while the receiver is > gated by a timer that ticks in 1ms > <https://en.wikipedia.org/wiki/Millisecond> intervals. The sending code > executes in ~600ns <https://en.wikipedia.org/wiki/Nanosecond> then blocks > waiting for the receiver. The receiver executes in ~8us > <https://en.wikipedia.org/wiki/Microsecond>, starting after the ticker > expires until it's blocked waiting for the timer’s next tick. Due to their > speedy execution, the go runtime scheduler never intervenes to preemptively > interrupt either goroutine. Moreover, the garbage collector doesn’t run > because only this same pair of goroutines are active and the message > communicated between them is an empty structure. > > In general the regular rhythm of the receiver’s timer appears in the > “Goroutine analysis” view of the trace report as evenly spaced markings at > nearly one 1ms intervals. However, these ruler-like markings are > sporadically absent resulting in gaps between 3-14ms. During these gaps > neither the receiving goroutine nor its coupled sender execute. When > selecting the rectangular gap area only “5 items'' are reported as selected. > These 5 items displayed by the “Counter Samples (5)” tab indicate that no > goroutines are “Running”, nor are there “Runnable” ones. Here are all the > counters reported by the selected gap appearing at time: ~502ms into the > trace report: > > Counter Series Time > Value > > Goroutines GCWaiting 502.32961800000004 0 > > Goroutines Runnable 502.32961800000004 0 > > Goroutines Running 502.32961800000004 0 > > Threads InSyscall 501.252301 0 > > Threads Running 501.252301 1 > > > Since the program was running trace, to collect the data needed to produce > its report, perhaps the trace itself induced a measure of go runtime latency? > To eliminate this potential latency source, the trace package and its > associated http server goroutine were eliminated. Instead, latency events > were directly detected and reported by the receiving goroutine. This > goroutine computed the latency between timer ticks by remembering the prior > timer value and subtracting it from the newly provided one. It outputted, > via fmt.Print, a message whenever the computed latency exceeded 5ms. It’s > source, main_latencycompute.go > <https://gist.github.com/WhisperingChaos/da17b8cb06d66d5d95a5285109b5af93#file-main_latencycompute-go> > is provided by the same gist > <https://gist.github.com/WhisperingChaos/da17b8cb06d66d5d95a5285109b5af93>. > > Before running the newly altered program, I half thought removing the > overhead of collecting trace data would eliminate the latency gaps depicted > in its reporting. However, after a bit of waiting, the altered program began > issuing sporadic latency messages. Furthermore, the activities of opening a > new terminal window or moving existing windows with a mouse, caused the > altered program to seemingly generate latency messages with greater > frequency. On noticing this behavior it seemed likely to me that the empty > gaps depicted by the trace report represent latency introduced by the OS > scheduler, as it preemptively stops and then restarts the go program. > > After experiencing these latency results with main_latencycompute.go > (1.18beta1) described above, it was recompiled using the go 1.17 compiler > version. The 1.17 version also generated latency messages. A comparison > between the latency intervals was performed to determine the similarity of > their values. The comparison process consisted of the following steps: > > The concurrent execution of the 1.18beta1 and 1.17 versions of > main_latencycompute.go for 100 minutes on the same Desktop VM mentioned > above. The Desktop system was left alone - no end user interaction while the > test ran. > > A second concurrent execution of the 1.18beta1 and 1.17 versions of > main_latencycompute.go for 100 minutes under the same conditions. > > The latency message values were aggregated for each version over the combined > 200 minute run time. > > The percentage difference between the versions was computed using this > calculator > <https://www.calculatorsoup.com/calculators/algebra/percent-difference-calculator.php> > and these values: 1.18beta1: 6707233630ns, 1.17: 6876776367ns. > > The percentage difference was ~2.5% while the difference in total latency > between them was ~170ms (over 200min). These figures suggest, at least to > me, that there’s no significant difference in the detected latency between > the two go compiler versions. Therefore, the latency gaps visible by the > trace tool report aren’t specific to 1.18beta1. > > As an additional observation, there’s a discrepancy between the Counter > Values reported by the “View trace” (“/trace”) web page and the one depicted > above rendered by the “Goroutine analysis” trace for a specific goroutine > “/trace?goid=8”. While the specific goroutine page above displays a “Threads > Running Value” of “1” between timeline values: 502.400-512.200ms, the > corresponding “Threads Running Value” reported by “/trace” is 0. Since the > raw data produce by tracing aligns with the “View trace” report and some > filtering/summation of this data is performed to produce the report for a > specific goroutine, it’s likely that the specific goroutine report might > incorrectly display a running thread when in reality there isn’t one? > > Counter Series Time > Value > > Goroutines GCWaiting 502.32961800000004 0 > > Goroutines Runnable 502.32961800000004 0 > > Goroutines Running 502.32961800000004 0 > > Threads InSyscall 502.33236 0 > > Threads Running 502.33236 0 > > If it can be confirmed that the stated values of trace data above > definitively delineate externally induced latency, it would greatly benefit > developers to update the trace report to label the latency gaps as external > to the executing program. > > > -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts+unsubscr...@googlegroups.com > <mailto:golang-nuts+unsubscr...@googlegroups.com>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/golang-nuts/9a24ce34-d6c2-49ce-8e59-e4b0258b51ffn%40googlegroups.com > > <https://groups.google.com/d/msgid/golang-nuts/9a24ce34-d6c2-49ce-8e59-e4b0258b51ffn%40googlegroups.com?utm_medium=email&utm_source=footer>. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/4086A836-5151-4DDC-AB02-8F43C103C9A8%40ix.netcom.com.