Re: [Pharo-dev] Curious context change latency experiment results (wrt DelayScheduler)

Ben Coman Sat, 26 May 2018 22:47:27 -0700

On 6 May 2018 at 04:20, Eliot Miranda <[email protected]> wrote:


> Hi Ben,
>
>
> On May 5, 2018, at 7:25 AM, Ben Coman <[email protected]> wrote:
>
>
>
> On 5 May 2018 at 22:10, Ben Coman <[email protected]> wrote:
>
>> One of the key parts of Delay scheduling is setting the resumption time.
>> There are two places this could be done.
>> a. In Delay>>#schedule, running (for example) at userSchedulingPriority=40
>> b. In DelayMicrosecondScheduler>>#handleTImerEvent:, running at
>> timingPriority=80
>>
>> When we were using the millisecond clock for delay scheduling,
>> it made sense to use (b.) since the clock might(?) roll over
>> between when resumption time was set, and when #handleTImerEvent:
>> expires delays.
>>
>> This should not be a problem now that we are using the microsecond clock,
>> so I wondered what the latency penalty might be between (a.) and (b.).
>> I ran a little experiment that indicates the cost/latency of switching
>> threads,
>> and was curious if anyone can comment on the validity of the experiment
>> and interpretation of results.
>>
>> I made the following three changes...
>>
>> DelayScheduler subclass: #DelayMicrosecondScheduler
>> instanceVariableNames: 'latencyStart countSinceLastZero'
>> classVariableNames: 'LatencyCounts'
>> poolDictionaries: ''
>> category: 'Kernel-Processes'
>>
>>
>> DelayMicrosecondScheduler>>#schedule: aDelay
>> latencyStart:= Time primUTCMicrosecondsClock. "This is position (a.)"
>> aDelay schedulerBeingWaitedOn ifTrue: [^self error: 'This Delay has
>> already been scheduled.'].
>> accessProtect critical: [
>> scheduledDelay := aDelay.
>> timingSemaphore signal. "#handleTimerEvent: sets scheduledDelay:=nil"
>> ].
>>
>>
>> DelayMicrosecondScheduler>>#handleTimerEvent: microsecondNowTick
>> | microsecondNextTick |
>> "Process any schedule requests"     "This is position (b.)"
>> scheduledDelay ifNotNil: [
>> |latency|
>> latency := Time primUTCMicrosecondsClock - latencyStart.
>> LatencyCounts ifNil: [  LatencyCounts := Bag new ].
>> LatencyCounts add: latency.
>> latency = 0
>> ifTrue: [ countSinceLastZero := 1 + (countSinceLastZero ifNil: [0])]
>> ifFalse: [Transcript
>> crShow: 'zero latency count ' , countSinceLastZero printString ;
>> show: ' before latency ', latency printString.
>> countSinceLastZero := 0].
>> "Schedule the given delay."
>> scheduledDelay scheduler: self resumptionTime: microsecondNowTick + (1000
>> * scheduledDelay millisecondDelayDuration).
>> self scheduleDelay: scheduledDelay.
>> scheduledDelay := nil ].
>> <snip>
>> rest of method unchanged
>>
>>
>> Then opened the Transcript and in Playground evaluated...
>> Delay delaySchedulerClass: DelayMicrosecondScheduler.
>>
>>
>> The Transcript results are shown below with some comments inserted.
>>
>> Now I guess the latency is affected by garbage collection.
>> But one thing I was curious about is why the latency's were quantised in
>> 1000s.
>>
>> Another interesting thing is that vast majority of the latency's were
>> zero,
>> which was a nice surprise, but can it be true?  Or is it a consequence
>> of the quantitisation rounding down?
>>
>> It seems that the idle-ness of the image affected how often a non-zero
>> latency occurred.
>> After I left the house for a while, the count of zero latency was very
>> high, but a few still occurred.  It would make sense there was less GC
>> while idle.  What is a good snippet of code to stress GC. I presume the
>> latency might increase.
>>
>>
>> zero latency count 2273 before latency 1000
>> zero latency count 943 before latency 1000
>> zero latency count 3666 before latency 1000
>> zero latency count 1643 before latency 1000
>> zero latency count 27 before latency 1000
>> "Left house for 20 minutes"
>> zero latency count 12022 before latency 1000
>> zero latency count 15195 before latency 1000
>> zero latency count 41998 before latency 1000
>> "Returned from outing"
>> zero latency count 128 before latency 1000
>> zero latency count 116 before latency 1000
>> zero latency count 555 before latency 1000
>> zero latency count 2377 before latency 1000
>> zero latency count 5423 before latency 1000
>> zero latency count 3178 before latency 1000
>> zero latency count 47 before latency 1000
>> zero latency count 2276 before latency 1000
>> "Left house to go shopping"
>> zero latency count 6708 before latency 3000
>> zero latency count 4896 before latency 1000
>> zero latency count 433 before latency 1000
>> zero latency count 7106 before latency 1000
>> zero latency count 2195 before latency 1000
>> zero latency count 12397 before latency 1000
>> zero latency count 4815 before latency 2000
>> zero latency count 3480 before latency 1000
>> zero latency count 5690 before latency 1000
>> zero latency count 1670 before latency 1000
>> zero latency count 320 before latency 1000
>> zero latency count 1641 before latency 2000
>> zero latency count 290 before latency 1000
>> zero latency count 3835 before latency 1000
>> zero latency count 3818 before latency 1000
>> zero latency count 1132 before latency 1000
>> "Returned from shopping."
>> zero latency count 2045 before latency 1000
>> zero latency count 671 before latency 1000
>> zero latency count 167 before latency 1000
>> zero latency count 147 before latency 1000
>>
>> From the System Browser, inspecting the class variable LatencyCounts
>> showed...
>>       0      171329
>> 3000      1
>> 2000      3
>> 1000      49
>>
>
> P.S. after running a few more hours,  LatencyCounts shows...
> 0          1102691
> 1000    219
> 2000    12
> 3000     29
> 4000     21
> 5000     5
> 49000   1
> 75000    1
>
>
> It's likely that the vm code is using the cheap microsecond click which is
> updated by the heartbeat, not the expensive one that interrogates the OS at
> the instant.  If not, it's the OS.  I'll check soon.
>

I went looking and found an obvious source of quantitization multiplying by
MicrosecondsPerMillisecond
in the Win32 currentUTCMicroseconds() line 109 here...
https://github.com/OpenSmalltalk/opensmalltalk-vm/blob/
e17db79411cfec767e04f3d94d12a642d920a30e/platforms/win32/vm/
sqWin32Time.c#L109

Running the same experiment (for 20 mins) on Linux produced a latency
distribution in line with what I was originally expecting...
usec        count
1 135
2 992
3 3250
4 21161
5 17309
6 5341
7 1230
8 313
9 79
10 44
11 46
12 50
13 36
14 58
15 73
16 38
17 12
18 7
19 5
20 1
21 3
23 2
24 4
25 4
26 5
27 4
29 3
30 6
31 5
32 6
33 5
34 2
35 2
36 5
37 4
39 4
40 2
41 2
44 3
46 2
47 1
48 4
49 3
51 2
53 1
54 1
56 2
58 1
59 1
61 1
66 2
70 1
82 1
86 1
184 1
544 1
576 1
1212 1
1402 1
2016 1

self average.   "4.8"microseconds
self stdev.        "12.9"microseconds
self average + (5 * self stdev)    "69.1"microseconds

So there are nine outliers outside five sigma.
@Clement, Would you expect your incremental GC to have much impact on those
outliers?
or its target pause is ~1000usecs anyway?

cheers -ben

Re: [Pharo-dev] Curious context change latency experiment results (wrt DelayScheduler)

Reply via email to