I suggest you use the linux scheduler tracing facilities, e.g. perf shed to 
diagnose (see https://www.brendangregg.com/perf.html#SchedulerAnalysis 
<https://www.brendangregg.com/perf.html#SchedulerAnalysis>)

But you may have to also do this outside the VM - to schedule trace the VM.

In general, timing at 1ms intervals with a 2% error is probably all you can 
expect running in a VM - especially if you don’t know (or manage) the load on 
the bare metal. 2% on 1ms is 20 us - and a raw bare metal scheduler will 
probably only get you to 5-6 us.

You can simplify your analysis by testing on bare metal first.



> On Feb 11, 2022, at 10:05 PM, Rich Moyse <r...@moyse.us> wrote:
> 
> TLDR
> I’d be grateful if someone could either confirm or dismiss the effect of OS, 
> go runtime, and/or Hypervisor latency on the output of a goroutine trace.  
> The trace report displays empty gaps of 3-14ms where expected periodic 
> goroutines should execute, as there’s a goroutine triggered by each 1ms timer 
> tick.  Attempts to select items within these gaps return only five “counter” 
> items - nothing else.  Those counter items suggest that nothing is running.  
> Perhaps external preemptive behavior interrupts the running go process, 
> thereby suspending the go runtime, causing these gaps?  If so, can these 
> trace gaps be attributed to OS and/or Hypervisor scheduler preemption?
> 
> 
> Although a recent (2/4/2022) go-nuts summary mentions a post to the thread: 
> Issues when using time.Ticker microsecond duration 
> <https://groups.google.com/g/golang-nuts/c/3GUnjdoWjqU/m/YFHL1ZruAwAJ>, whose 
> other posts discuss a range of causes, including OS induced latency, this 
> thread attempts to understand how external latency sources affect the trace 
> report.
> 
> In addition to the problem description below, the go code and trace screen 
> images are available via this gist 
> <https://gist.github.com/WhisperingChaos/da17b8cb06d66d5d95a5285109b5af93> 
> while the trace file is available here 
> <https://github.com/WhisperingChaos/traceissue/raw/main/dply/trace/trace.1.18beta1.out>.
>   The go program was compiled first using 1.18beta1 and later with 1.17.  
> It’s running in a 16.04 Ubuntu Desktop VM that’s managed by Hyper-V on a 
> physical machine running Windows 10 Pro.  The go program limits itself to the 
> interplay between two goroutines: a single sending goroutine and a single 
> receiving one.
> 
> These goroutines share an unbuffered channel.  The sender is only throttled 
> by the capacity of the receiver to consume its messages while the receiver is 
> gated by a timer that ticks in 1ms 
> <https://en.wikipedia.org/wiki/Millisecond> intervals.  The sending code 
> executes in ~600ns <https://en.wikipedia.org/wiki/Nanosecond> then blocks 
> waiting for the receiver.  The receiver executes in ~8us 
> <https://en.wikipedia.org/wiki/Microsecond>, starting after the ticker 
> expires until it's blocked waiting for the timer’s next tick.  Due to their 
> speedy execution, the go runtime scheduler never intervenes to preemptively 
> interrupt either goroutine.  Moreover, the garbage collector doesn’t run 
> because only this same pair of goroutines are active and the message 
> communicated between them is an empty structure.
> 
> In general the regular rhythm of the receiver’s timer appears in the 
> “Goroutine analysis” view of the trace report as evenly spaced markings at 
> nearly one 1ms intervals.  However, these ruler-like markings are 
> sporadically absent resulting in gaps between 3-14ms.  During these gaps 
> neither the receiving goroutine nor its coupled sender execute.  When 
> selecting the rectangular gap area only “5 items'' are reported as selected. 
> These 5 items displayed by the “Counter Samples (5)” tab indicate that no 
> goroutines are “Running”, nor are there “Runnable” ones.  Here are all the 
> counters reported by the selected gap appearing at time: ~502ms into the 
> trace report: 
>       
> Counter          Series             Time                                   
> Value
> 
> Goroutines GCWaiting 502.32961800000004 0
> 
> Goroutines    Runnable          502.32961800000004    0
> 
> Goroutines    Running           502.32961800000004    0
> 
> Threads               InSyscall         501.252301                    0
> 
>  Threads              Running           501.252301                    1
> 
> 
> Since the program was running trace, to collect the data needed to produce 
> its report, perhaps the trace itself induced a measure of go runtime latency? 
>  To eliminate this potential latency source, the trace package and its 
> associated http server goroutine were eliminated.  Instead, latency events 
> were directly detected and reported by the receiving goroutine.  This 
> goroutine computed the latency between timer ticks by remembering the prior 
> timer value and subtracting it from the newly provided one.  It outputted, 
> via fmt.Print, a message whenever the computed latency exceeded 5ms.  It’s 
> source, main_latencycompute.go 
> <https://gist.github.com/WhisperingChaos/da17b8cb06d66d5d95a5285109b5af93#file-main_latencycompute-go>
>  is provided by the same gist 
> <https://gist.github.com/WhisperingChaos/da17b8cb06d66d5d95a5285109b5af93>.
> 
> Before running the newly altered program, I half thought removing the 
> overhead of collecting trace data would eliminate the latency gaps depicted 
> in its reporting.  However, after a bit of waiting, the altered program began 
> issuing sporadic latency messages.  Furthermore, the activities of opening a 
> new terminal window or moving existing windows with a mouse, caused the 
> altered program to seemingly generate latency messages with greater 
> frequency.  On noticing this behavior it seemed likely to me that the empty 
> gaps depicted by the trace report represent latency introduced by the OS 
> scheduler, as it preemptively stops and then restarts the go program.
> 
> After experiencing these latency results with main_latencycompute.go 
> (1.18beta1)  described above, it was recompiled using the go 1.17 compiler 
> version.  The 1.17 version also generated latency messages.  A comparison 
> between the latency intervals was performed to determine the similarity of 
> their values.  The comparison process consisted of the following steps:
> 
> The concurrent execution of the 1.18beta1 and 1.17 versions of 
> main_latencycompute.go for 100 minutes on the same Desktop VM mentioned 
> above.  The Desktop system was left alone - no end user interaction while the 
> test ran.
> 
> A second concurrent execution of the 1.18beta1 and 1.17 versions of 
> main_latencycompute.go for 100 minutes under the same conditions.  
> 
> The latency message values were aggregated for each version over the combined 
> 200 minute run time.
> 
> The percentage difference between the versions was computed using this 
> calculator 
> <https://www.calculatorsoup.com/calculators/algebra/percent-difference-calculator.php>
>  and these values:  1.18beta1: 6707233630ns, 1.17: 6876776367ns.
> 
> The percentage difference was ~2.5% while the difference in total latency 
> between them was ~170ms (over 200min).  These figures suggest, at least to 
> me, that there’s no significant difference in the detected latency between 
> the two go compiler versions.  Therefore, the latency gaps visible by the 
> trace tool report aren’t specific to 1.18beta1.
> 
> As an additional observation, there’s a discrepancy between the Counter 
> Values reported by the “View trace” (“/trace”) web page and the one depicted 
> above rendered by the “Goroutine analysis” trace for a specific goroutine 
> “/trace?goid=8”.  While the specific goroutine page above displays a “Threads 
> Running Value” of “1” between timeline values: 502.400-512.200ms, the 
> corresponding “Threads Running Value” reported by “/trace” is 0.  Since the 
> raw data produce by tracing aligns with the “View trace” report and some 
> filtering/summation of this data is performed to produce the report for a 
> specific goroutine, it’s likely that the specific goroutine report might 
> incorrectly display a running thread when in reality there isn’t one?
> 
> Counter            Series             Time                                  
> Value
> 
> Goroutines       GCWaiting      502.32961800000004    0
> 
> Goroutines       Runnable        502.32961800000004    0
> 
> Goroutines       Running          502.32961800000004    0
> 
> Threads            InSyscall         502.33236                        0
> 
> Threads            Running           502.33236                       0
> 
> If it can be confirmed that the stated values of trace data above 
> definitively delineate externally induced latency, it would greatly benefit 
> developers to update the trace report to label the latency gaps as external 
> to the executing program.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts+unsubscr...@googlegroups.com 
> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/golang-nuts/9a24ce34-d6c2-49ce-8e59-e4b0258b51ffn%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/golang-nuts/9a24ce34-d6c2-49ce-8e59-e4b0258b51ffn%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/4086A836-5151-4DDC-AB02-8F43C103C9A8%40ix.netcom.com.

Reply via email to