Thanks for the detailed replies. I updated the Racket code and Makefile to uses gcstats by default instead of debug@GC. In my tests, this does not change worst-case latencies (I did not expect it toreason it would), but the generated summary is nice and easier to parse for non-experts than debug@GC output. The old instrumentation is still available with run-racket-instrumented, analyze-racket-instrumented. On my machine I get a realiable 145ms worst-case latency on the updated benchmark -- with reduce memory loads thanks to the bytestring suggestion.
I experimented with the case where the message (byte)strings are of length 1 instead. In this case, I'm able to get a good compromise, working with Matthew's suggestion that the rampup is difficult for the incremental GC: if I perform explicit collections only during the rampup period (when (and (i . < . window-size) (zero? ...))), I get short pause times (30ms) with little costs in terms of throughput -- especially if I enlarge the total msg-count, which I think is fine. However, I was unable to reproduce these rampup-only improvements with 1024. (Another thing I found out is that the worst-case latency is very sensitive to the choice of frequency of minor GCs during the rampup. I got nice results with modulo 100, but worse results with modulo 50 or even modulo 70.) Matthew: I would like to eventually write a blog post to report on my findings for OCaml and Racket, and of course if would be nice if Racket looked as good as reasonably possible¹. If you were willing to make publicly available a patchset with your GC tuning on top of Racket's current master branch (or whatever; or apply them to master directly, of course), I would be glad to report the numbers using them. Hoping to improve on the ramp-up issue, I tried to build the window-sized chan as an associative list, and build the immutable hashtable with (make-immutable-hash window), perform a major GC after that, and only then start the steady-state loop that inserts and removes from the hashtable. This did not improve thing; thinking back, it's natural it makes no different to the GC whether the ramp-up allocations are stored in an alist rather than in an immutable table. ¹: The criticism in GHC on that benchmark is not that a given benchmarking code has too high latencies, but that these latencies are impossible to optimize away, even by expert tuning. In that spirit, I'm fine with tweaking the Racket benchmark to do essentially the same computation, even if it has to do non-trivial things like calling the GC explicitly. What we are modelling is whether it is possible to reduce latencies, with enough efforts, in a given language. Of course, large reductions of throughput may not be acceptable, so I would probably report (as Matthew did above) on worst-case latency "with reasonable throughput" first, then potential improvements that sacrifice throughput. On Sun, May 15, 2016 at 10:05 PM, Matthew Flatt <[email protected]> wrote: > At Sat, 14 May 2016 22:08:32 -0400, Gabriel Scherer wrote: >> Short version: >> Racket has relatively poor worst-case GC pause times on a specific >> benchmark: >> https://gitlab.com/gasche/gc-latency-experiment/blob/master/main.rkt >> >> 1) Is this expected? What is the GC algorithm for Racket's old generation? > > Racket mostly uses mark-and-sweep for the old generation. It sometimes > compacts, but not usually. (I tried disabling compaction in the GC, and > it doesn't seem to affect the benchmark.) > > Although Racket's incremental GC mode can help some, as you noted, it > doesn't help out-of-the-box for the benchmark. > >> 2) Can you make it better? > > Only a little, in the short term. > > >> Long version: > > Beware that Racket strings are Unicode strings stored as UCS-4. So, I > think the benchmark makes Racket allocate 4 times as much as the OCaml > and Haskell variants. > > From a GC perspective, that maybe shouldn't matter too much, because > the data is mostly atomic and not traversed. Still, it reduces > locality. Making matter worse, the GC in Racket v6.5 is more eager than > intended to return pages to the OS (and the size of those pages is > proportional to the size of the message); making the GC less eager to > return pages cuts the peak GC time from 225ms to 160ms on my OS X > machine. The effect is likely OS-specific. > > Changing the benchmark to use a byte string reduces the peak GC time to > 140ms. The 20ms saved reflects some general locality effects, but maybe > also some remaining weakness in the GC's handling of newly unused > pages. > > > A side issue is that inserting consecutive numbers into an immutable > hash table is a bad case for Racket's implementation in v6.5. (I've > been meaning to look at this for a while.) It ends up creating > intermediate nodes that are larger than average, which again reduces > locality. After adjusting the implementation to avoid trouble with > sequential hash codes, the peak GC time ends up around 100ms on my > machine. > > > Meanwhile, incremental GC doesn't help because it doesn't adapt well to > ramping up from stand-still. The current, first cut at incremental mode > is mostly useful in a steady state (typical for a video game). If I > increase the message count by a factor of 10, incremental GC mode > starts helping a little: > > % env PLT_INCREMENTAL_GC=y racket -W "debug@GC" main.rkt | & grep MAJ > GC: 0:MAJ @ 51,218K(+37,101K)...; free 3,087K(-3,087K) 17ms @ 331 > GC: 0:MAJ @ 107,847K(+29,624K)...; free 9,304K(-19,560K) 42ms @ 483 > GC: 0:MAJ @ 209,809K(+38,814K)...; free 6,715K(-22,635K) 56ms @ 781 > GC: 0:MAJ @ 416,454K(+44,697K)...; free 9,512K(-6,920K) 86ms @ 1477 > GC: 0:MAJ @ 824,407K(+76,520K)...; free 182,074K(-182,074K) 113ms @ 2888 > GC: 0:MAJ @ 1,294,379K(+98,068K)...; free 417,187K(-417,187K) 114ms @ 5066 > GC: 0:MAJ @ 1,763,263K(+120,704K)...; free 647,816K(-647,816K) 64ms @ 8067 > GC: 0:MAJ @ 2,028,264K(+117,847K)...; free 885,256K(-901,640K) 43ms @ 10918 > GC: 0:MAJ @ 2,035,484K(+127,011K)...; free 912,992K(-912,992K) 47ms @ 13779 > GC: 0:MAJ @ 2,026,836K(+135,659K)...; free 890,614K(-890,614K) 41ms @ 16503 > .... > > A program can request incremental mode and encourage more incremental > work through the `collect-garbage` function. Adding > > (when (zero? (modulo i 100)) > (collect-garbage 'incremental) > (collect-garbage 'minor)) > > to the `for/fold` loop brings peak GC time down to around 30ms (also > using the GC and HAMT improvements), but the program takes twice as > long to run. Using 50 instead of 100 in the `modulo` brings the peak > time to 20ms, but then the program takes three times as long to run. > -- You received this message because you are subscribed to the Google Groups "Racket Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/racket-dev/CAPFanBHw1mo6z8hzAoEKiomjTcoKBK2CXR5MnjYW-oniC8cSYg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
