Thanks for the detailed replies. I updated the Racket code and
Makefile to uses gcstats by default instead of debug@GC. In my tests,
this does not change worst-case latencies (I did not expect it
toreason it would), but the generated summary is nice and easier to
parse for non-experts than debug@GC output. The old instrumentation is
still available with run-racket-instrumented,
analyze-racket-instrumented. On my machine I get a realiable 145ms
worst-case latency on the updated benchmark -- with reduce memory
loads thanks to the bytestring suggestion.

I experimented with the case where the message (byte)strings are of
length 1 instead. In this case, I'm able to get a good compromise,
working with Matthew's suggestion that the rampup is difficult for the
incremental GC: if I perform explicit collections only during the
rampup period (when (and (i . < . window-size) (zero? ...))), I get
short pause times (30ms) with little costs in terms of throughput --
especially if I enlarge the total msg-count, which I think is fine.
However, I was unable to reproduce these rampup-only improvements with
1024.
(Another thing I found out is that the worst-case latency is very
sensitive to the choice of frequency of minor GCs during the rampup. I
got nice results with modulo 100, but worse results with modulo 50 or
even modulo 70.)

Matthew: I would like to eventually write a blog post to report on my
findings for OCaml and Racket, and of course if would be nice if
Racket looked as good as reasonably possible¹. If you were willing to
make publicly available a patchset with your GC tuning on top of
Racket's current master branch (or whatever; or apply them to master
directly, of course), I would be glad to report the numbers using
them.

Hoping to improve on the ramp-up issue, I tried to build the
window-sized chan as an associative list, and build the immutable
hashtable with (make-immutable-hash window), perform a major GC after
that, and only then start the steady-state loop that inserts and
removes from the hashtable. This did not improve thing; thinking back,
it's natural it makes no different to the GC whether the ramp-up
allocations are stored in an alist rather than in an immutable table.

¹: The criticism in GHC on that benchmark is not that a given
benchmarking code has too high latencies, but that these latencies are
impossible to optimize away, even by expert tuning. In that spirit,
I'm fine with tweaking the Racket benchmark to do essentially the same
computation, even if it has to do non-trivial things like calling the
GC explicitly. What we are modelling is whether it is possible to
reduce latencies, with enough efforts, in a given language. Of course,
large reductions of throughput may not be acceptable, so I would
probably report (as Matthew did above) on worst-case latency "with
reasonable throughput" first, then potential improvements that
sacrifice throughput.

On Sun, May 15, 2016 at 10:05 PM, Matthew Flatt <[email protected]> wrote:
> At Sat, 14 May 2016 22:08:32 -0400, Gabriel Scherer wrote:
>> Short version:
>>   Racket has relatively poor worst-case GC pause times on a specific
>>   benchmark:
>>     https://gitlab.com/gasche/gc-latency-experiment/blob/master/main.rkt
>>
>>   1) Is this expected? What is the GC algorithm for Racket's old generation?
>
> Racket mostly uses mark-and-sweep for the old generation. It sometimes
> compacts, but not usually. (I tried disabling compaction in the GC, and
> it doesn't seem to affect the benchmark.)
>
> Although Racket's incremental GC mode can help some, as you noted, it
> doesn't help out-of-the-box for the benchmark.
>
>>   2) Can you make it better?
>
> Only a little, in the short term.
>
>
>> Long version:
>
> Beware that Racket strings are Unicode strings stored as UCS-4. So, I
> think the benchmark makes Racket allocate 4 times as much as the OCaml
> and Haskell variants.
>
> From a GC perspective, that maybe shouldn't matter too much, because
> the data is mostly atomic and not traversed. Still, it reduces
> locality. Making matter worse, the GC in Racket v6.5 is more eager than
> intended to return pages to the OS (and the size of those pages is
> proportional to the size of the message); making the GC less eager to
> return pages cuts the peak GC time from 225ms to 160ms on my OS X
> machine. The effect is likely OS-specific.
>
> Changing the benchmark to use a byte string reduces the peak GC time to
> 140ms. The 20ms saved reflects some general locality effects, but maybe
> also some remaining weakness in the GC's handling of newly unused
> pages.
>
>
> A side issue is that inserting consecutive numbers into an immutable
> hash table is a bad case for Racket's implementation in v6.5. (I've
> been meaning to look at this for a while.) It ends up creating
> intermediate nodes that are larger than average, which again reduces
> locality. After adjusting the implementation to avoid trouble with
> sequential hash codes, the peak GC time ends up around 100ms on my
> machine.
>
>
> Meanwhile, incremental GC doesn't help because it doesn't adapt well to
> ramping up from stand-still. The current, first cut at incremental mode
> is mostly useful in a steady state (typical for a video game). If I
> increase the message count by a factor of 10, incremental GC mode
> starts helping a little:
>
>  % env PLT_INCREMENTAL_GC=y racket -W "debug@GC" main.rkt | & grep MAJ
>  GC: 0:MAJ @ 51,218K(+37,101K)...; free 3,087K(-3,087K) 17ms @ 331
>  GC: 0:MAJ @ 107,847K(+29,624K)...; free 9,304K(-19,560K) 42ms @ 483
>  GC: 0:MAJ @ 209,809K(+38,814K)...; free 6,715K(-22,635K) 56ms @ 781
>  GC: 0:MAJ @ 416,454K(+44,697K)...; free 9,512K(-6,920K) 86ms @ 1477
>  GC: 0:MAJ @ 824,407K(+76,520K)...; free 182,074K(-182,074K) 113ms @ 2888
>  GC: 0:MAJ @ 1,294,379K(+98,068K)...; free 417,187K(-417,187K) 114ms @ 5066
>  GC: 0:MAJ @ 1,763,263K(+120,704K)...; free 647,816K(-647,816K) 64ms @ 8067
>  GC: 0:MAJ @ 2,028,264K(+117,847K)...; free 885,256K(-901,640K) 43ms @ 10918
>  GC: 0:MAJ @ 2,035,484K(+127,011K)...; free 912,992K(-912,992K) 47ms @ 13779
>  GC: 0:MAJ @ 2,026,836K(+135,659K)...; free 890,614K(-890,614K) 41ms @ 16503
>  ....
>
> A program can request incremental mode and encourage more incremental
> work through the `collect-garbage` function. Adding
>
>    (when (zero? (modulo i 100))
>      (collect-garbage 'incremental)
>      (collect-garbage 'minor))
>
> to the `for/fold` loop brings peak GC time down to around 30ms (also
> using the GC and HAMT improvements), but the program takes twice as
> long to run. Using 50 instead of 100 in the `modulo` brings the peak
> time to 20ms, but then the program takes three times as long to run.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-dev/CAPFanBHw1mo6z8hzAoEKiomjTcoKBK2CXR5MnjYW-oniC8cSYg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to