[racket-dev] Racket's worst-case GC latencies

Gabriel Scherer Sat, 14 May 2016 19:09:24 -0700

Hi racket-devel,

Short version:
  Racket has relatively poor worst-case GC pause times on a specific
  benchmark:
    https://gitlab.com/gasche/gc-latency-experiment/blob/master/main.rkt


  1) Is this expected? What is the GC algorithm for Racket's old generation?
  2) Can you make it better?

Long version:

## Context

James Fisher has a blog post on a case where GHC's runtime system
imposed unpleasant latencies/pauses on their Haskell program:

  https://blog.pusher.com/latency-working-set-ghc-gc-pick-two/

The blog post proposes a very simple, synthetic benchmark that exhibit
the issue -- namely, if the old generation of the GC uses a copying
strategy, then those copies can incur long pauses when many large
objects are live in the old general.

I ported this synthetic benchmark to OCaml, and could check that the
OCaml GC suffers from no such issue, as its old generation uses
a mark&sweep strategy that does not copy old memory. The Haskell
benchmark has worst-case latencies around 50ms, which James Fisher
finds excessive. The OCaml benchmark has worst-case latencies around
3ms.

Max New did a port of the benchmark to Racket, which I later modified;
the results I see on my machine are relatively bad: the worst-case
pause time is between 120ms and 220ms on my tests.

I think that the results are mostly unrelated to the specific edge
case that this benchmark was designed to exercize (copies of large
objects in the old generation): if I change the inserted strings to be
of size 1 instead of 1024, I also observe fairly high latencies --
such as 120ms. So I'm mostly observing high latencies by inserting and
removing things from an immutable hash in a loop.

## Reproducing

The benchmark fills an (immutable) associative structure with strings
of length 1024 (the idea is to have relatively high memory usage per
pointer, to see large copy times), keeping at most 200,000 strings in
the working set. In total, it inserts 1,000,000 strings (and thus
removes 800,000, one after each insertion after the first 200,000). We
measure latencies rather than throughput, so the performance details
of the associative map structure do not matter.

My benchmark code in Haskell, OCaml and Racket can be found here:
  https://gitlab.com/gasche/gc-latency-experiment.git
  https://gitlab.com/gasche/gc-latency-experiment/tree/master
the Makefile contains my scripts to compile, run and analyze each
language's version.

To run the Racket benchmark with instrumentation

    PLTSTDERR=debug@GC racket main.rkt 2> racket-gc-log

To extract the pause times from the resulting log file (in the format
produced by Racket 6.5), I do:

   cat racket-gc-log | grep -v total | cut -d' ' -f7 | sort -n

Piping `| uniq --count` after that produces a poor-man histogram of
latencies. I get the following result on my machine:

      1 0ms
      2 1ms
      1 2ms
      1 3ms
      2 4ms
      1 5ms
      1 6ms
      3 8ms
     12 9ms
      1 11ms
      2 12ms
     38 13ms
    126 14ms
     43 15ms
     13 16ms
     19 17ms
      4 18ms
      1 19ms
      1 21ms
      1 48ms
      1 68ms
      1 70ms
      1 133ms
      1 165ms
      1 220ms
      1 227ms
      1 228ms


## Non-incremental vs. incremental GC

We experimented with PLT_INCREMENTAL_GC=1; on my machine, this does
not decrease the worst-case pause time; on Asumu Takikawa's beefier
machine, I think the pause times decreased a bit -- but still well
above 50ms. Because the incremental GC consumes sensibly more memory,
I am unable to test with both PLT_INCREMENTAL_GC and
PLTSTDERR=debug@gc enabled -- my system runs out of memory.

If I reduce the benchmark sizes in half (half the working size, half
the number of iteration), I can run the incremental GC with debugging
enabled. On this half instance, I observe the following results:

for the *non-incremental* GC:

      2 1ms
      1 2ms
      2 3ms
      2 4ms
      1 5ms
      1 6ms
      9 8ms
      2 9ms
      1 10ms
     38 13ms
     43 14ms
     13 15ms
      8 16ms
      5 17ms
      6 18ms
      1 44ms
      1 66ms
      1 75ms
      2 126ms
      1 136ms
      1 142ms

for the *incremental* GC

      2 1ms
      1 2ms
      2 3ms
      3 4ms
      1 5ms
     38 6ms
    155 7ms
    136 8ms
     78 9ms
     56 10ms
     28 11ms
     16 12ms
      2 14ms
      1 15ms
      1 16ms
      2 20ms
      1 32ms
      1 41ms
      1 61ms
      1 101ms
      1 148ms

As you can see, the incremental GC helps, as the distribution of
pauses moves to shorter pause times: it does more, shorter,
pauses. However, the worst-case pauses do not improve -- in fact they
are even a bit worse.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-dev/CAPFanBE6%2BMVxG44nHnw_HCsaNwZ-HSp4L7%2BVT-v-QJ%3Div-EK1g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[racket-dev] Racket's worst-case GC latencies

Reply via email to