Thanks everyone for the pointers, I'm back with more logs and traces. This is the context, as captured in a tmux after running rcctl start and ktrace in another window. SSH dropped halfway through.
> $ while true; do sleep 1; date; done & tail -f /var/log/daemon > Sat Mar 25 23:42:16 UTC 2017 > Sat Mar 25 23:42:17 UTC 2017 > Mar 25 23:42:18 7472837 camlistored[7531]: 2017/03/25 23:42:18 Reading encryption metadata... > Sat Mar 25 23:42:18 UTC 2017 > Sat Mar 25 23:42:19 UTC 2017 > Sat Mar 25 23:42:20 UTC 2017 > Sat Mar 25 23:42:21 UTC 2017 > [... intermittent gaps ...] > Sat Mar 25 23:45:49 UTC 2017 > Sat Mar 25 23:46:05 UTC 2017 > Sat Mar 25 23:46:08 UTC 2017 > Sat Mar 25 23:46:09 UTC 2017 > Mar 25 23:46:08 7472837 Tor[40093]: Your system clock just jumped 175 seconds forward; assuming established circuits no longer work. > Sat Mar 25 23:46:10 UTC 2017 > Sat Mar 25 23:46:11 UTC 2017 > Sat Mar 25 23:46:13 UTC 2017 > Sat Mar 25 23:46:14 UTC 2017 > [... intermittent gaps ...] > Sat Mar 25 23:49:24 UTC 2017 > Sat Mar 25 23:49:25 UTC 2017 > Sat Mar 25 23:49:26 UTC 2017 > Sat Mar 25 23:49:41 UTC 2017 > Mar 25 23:49:43 7472837 ntpd[64718]: IMSG_CONSTRAINT_KILL for invalid id 2 > Sat Mar 25 23:49:44 UTC 2017 > Sat Mar 25 23:49:46 UTC 2017 > Sat Mar 25 23:49:56 UTC 2017 > Sat Mar 25 23:49:57 UTC 2017 > [... intermittent gaps ...] > Sat Mar 25 23:54:40 UTC 2017 > Sat Mar 25 23:54:41 UTC 2017 > Sat Mar 25 23:54:42 UTC 2017 > Sat Mar 25 23:54:51 UTC 2017 > Sat Mar 25 23:54:56 UTC 2017 > Sat Mar 25 23:54:57 UTC 2017 > Sat Mar 25 23:54:58 UTC 2017 > Sat Mar 25 23:54:59 UTC 2017 > Sat Mar 25 23:55:22 UTC 2017 > Mar 25 23:55:23 7472837 camlistored[7531]: 2017/03/25 23:55:23 Caught panic installer handlers: error instantiating storage for prefix "/enc-b2/", type "encrypt": error scanning metadata on start-up: error with meta blob sha1-xxx: fetch failed: Get https://f001.backblazeb2.com/file/xxx/meta/sha1-xxx: net/http: TLS handshake timeout > Mar 25 23:55:23 7472837 Tor[40093]: Your system clock just jumped 178 seconds forward; assuming established circuits no longer work. > Sat Mar 25 23:55:27 UTC 2017 > Mar 25 23:55:27 7472837 Tor[40093]: Tor has successfully opened a circuit. Looks like client functionality is working. > Sat Mar 25 23:55:28 UTC 2017 > Sat Mar 25 23:55:29 UTC 2017 And here's the ktrace -di -p which is... essentially empty. This is an excerpt from deep into the freeze. > 7531 camlistored 1490486112.921212 STRU struct timespec { 459373.989343598 } > 7531 camlistored 1490486112.921215 RET clock_gettime 0 > 7531 camlistored 1490486112.921217 CALL clock_gettime(CLOCK_REALTIME,0xc420031ef8) > 7531 camlistored 1490486112.921220 STRU struct timespec { 1490486112<"Mar 25 23:55:12 2017">.921219705 } > 7531 camlistored 1490486112.921222 RET clock_gettime 0 > 7531 camlistored 1490486112.921225 CALL kevent(5,0,0,0xc420031730,64,0xc420031718) > 7531 camlistored 1490486112.921228 STRU struct timespec { 0 } > 7531 camlistored 1490486112.921231 RET kevent 0 > 7531 camlistored 1490486112.921234 CALL nanosleep(0xc420031f20,0) > 7531 camlistored 1490486112.921237 STRU struct timespec { 0.010000000 } > 7531 camlistored 1490486112.941202 RET nanosleep 0 > 7531 camlistored 1490486112.941211 CALL clock_gettime(CLOCK_MONOTONIC,0xc420031f20) > 7531 camlistored 1490486112.941216 STRU struct timespec { 459374.009347523 } > 7531 camlistored 1490486112.941219 RET clock_gettime 0 > 7531 camlistored 1490486112.941221 CALL clock_gettime(CLOCK_REALTIME,0xc420031ef8) > 7531 camlistored 1490486112.941224 STRU struct timespec { 1490486112<"Mar 25 23:55:12 2017">.941223979 } > 7531 camlistored 1490486112.941226 RET clock_gettime 0 > 7531 camlistored 1490486112.941229 CALL kevent(5,0,0,0xc420031730,64,0xc420031718) > 7531 camlistored 1490486112.941232 STRU struct timespec { 0 } > 7531 camlistored 1490486112.941235 RET kevent 0 > 7531 camlistored 1490486112.941238 CALL nanosleep(0xc420031f20,0) > 7531 camlistored 1490486112.941240 STRU struct timespec { 0.010000000 } > 7531 camlistored 1490486112.961205 RET nanosleep 0 > 7531 camlistored 1490486112.961214 CALL clock_gettime(CLOCK_MONOTONIC,0xc420031f20) > 7531 camlistored 1490486112.961220 STRU struct timespec { 459374.029350819 } > 7531 camlistored 1490486112.961222 RET clock_gettime 0 > 7531 camlistored 1490486112.961224 CALL clock_gettime(CLOCK_REALTIME,0xc420031ef8) > 7531 camlistored 1490486112.961227 STRU struct timespec { 1490486112<"Mar 25 23:55:12 2017">.961227136 } > 7531 camlistored 1490486112.961229 RET clock_gettime 0 > 7531 camlistored 1490486112.961232 CALL kevent(5,0,0,0xc420031730,64,0xc420031718) > 7531 camlistored 1490486112.961235 STRU struct timespec { 0 } > 7531 camlistored 1490486112.961238 RET kevent 0 > 7531 camlistored 1490486112.961241 CALL nanosleep(0xc420031f20,0) > 7531 camlistored 1490486112.961244 STRU struct timespec { 0.010000000 } > 7531 camlistored 1490486112.981207 RET nanosleep 0 > 7531 camlistored 1490486112.981215 CALL clock_gettime(CLOCK_MONOTONIC,0xc420031f20) > 7531 camlistored 1490486112.981221 STRU struct timespec { 459374.049352090 } > 7531 camlistored 1490486112.981223 RET clock_gettime 0 > 7531 camlistored 1490486112.981225 CALL clock_gettime(CLOCK_REALTIME,0xc420031ef8) > 7531 camlistored 1490486112.981229 STRU struct timespec { 1490486112<"Mar 25 23:55:12 2017">.981228616 } > 7531 camlistored 1490486112.981231 RET clock_gettime 0 > 7531 camlistored 1490486112.981234 CALL kevent(5,0,0,0xc420031730,64,0xc420031718) > 7531 camlistored 1490486112.981237 STRU struct timespec { 0 } > 7531 camlistored 1490486112.981239 RET kevent 0 > 7531 camlistored 1490486112.981242 CALL nanosleep(0xc420031f20,0) > 7531 camlistored 1490486112.981245 STRU struct timespec { 0.010000000 } The only two other events happened in the surrounding 10 seconds are these. > 7531 camlistored 1490486111.961230 STRU struct kevent { ident=11, filter=EVFILT_READ, flags=0x8021<EV_ADD|EV_CLEAR|EV_EOF>, fflags=0<>, data=4175, udata=0x2280f96c0 } > 7531 camlistored 1490486111.961232 RET kevent 1 > 7531 camlistored 1490486111.981238 STRU struct kevent { ident=85, filter=EVFILT_READ, flags=0x8021<EV_ADD|EV_CLEAR|EV_EOF>, fflags=0<>, data=4175, udata=0x2280f9480 } > 7531 camlistored 1490486111.981241 RET kevent 1 Something I noticed is that generating I/O during that step, either to syslog or to stderr (by increasing verbosity), significantly increases the chances of the process making it through and seems to reduce the pauses in other processes. Drilling a little more into memory utilization with a gctrace seems to suggest that Go does obtain from the OS a bigger allocation than there is resident memory, but never really uses all of it, so I'd expect most hot pages to fit in memory. > gc 1 @0.511s 3%: 18+7.6+0.049 ms clock, 37+4.0/0.23/5.3+0.098 ms cpu, 4->4->1 MB, 5 MB goal, 2 P > gc 2 @0.551s 3%: 0.013+4.1+0.046 ms clock, 0.026+0/0.010/4.0+0.092 ms cpu, 4->4->2 MB, 5 MB goal, 2 P > gc 3 @0.574s 3%: 0.013+1.6+0.051 ms clock, 0.027+0.24/0.10/1.4+0.10 ms cpu, 5->6->4 MB, 6 MB goal, 2 P > gc 4 @2.518s 0%: 0.015+5.4+0.045 ms clock, 0.030+0.39/0.67/5.4+0.090 ms cpu, 11->11->8 MB, 12 MB goal, 2 P > gc 5 @2.531s 1%: 8.9+2.6+0.026 ms clock, 17+0.18/0.65/2.1+0.052 ms cpu, 12->12->8 MB, 17 MB goal, 2 P > gc 6 @2.663s 1%: 0.015+2.9+0.038 ms clock, 0.031+0.23/1.0/2.1+0.077 ms cpu, 17->17->12 MB, 18 MB goal, 2 P > gc 7 @3.257s 0%: 0.017+3.0+0.048 ms clock, 0.035+0.26/1.2/1.7+0.096 ms cpu, 1038->1038->1037 MB, 1039 MB goal, 2 P > gc 8 @10.316s 0%: 0.011+2.0+0.040 ms clock, 0.023+0/1.9/1.9+0.080 ms cpu, 1038->1038->13 MB, 2074 MB goal, 2 P > gc 9 @10.619s 0%: 10+5.8+0.037 ms clock, 20+0.41/5.7/0+0.074 ms cpu, 20->21->17 MB, 27 MB goal, 2 P > gc 10 @10.732s 0%: 7.4+3.6+0.040 ms clock, 14+0.11/3.6/0+0.080 ms cpu, 31->31->19 MB, 35 MB goal, 2 P > gc 11 @10.811s 0%: 0.014+4.2+0.28 ms clock, 0.028+0.34/2.0/1.7+0.56 ms cpu, 35->35->19 MB, 38 MB goal, 2 P > gc 12 @10.978s 0%: 0.015+4.4+0.042 ms clock, 0.031+0.28/3.6/0.49+0.084 ms cpu, 37->37->23 MB, 38 MB goal, 2 P > gc 13 @11.211s 0%: 0.019+5.6+0.040 ms clock, 0.038+0.28/4.8/0.53+0.081 ms cpu, 46->46->29 MB, 47 MB goal, 2 P > gc 14 @11.482s 0%: 0.015+4.8+0.038 ms clock, 0.031+0.85/4.0/0.38+0.077 ms cpu, 56->57->29 MB, 58 MB goal, 2 P > gc 15 @11.752s 0%: 0.19+4.2+0.17 ms clock, 0.39+1.7/2.6/0+0.34 ms cpu, 59->59->30 MB, 60 MB goal, 2 P > gc 16 @12.002s 0%: 0.020+5.2+0.045 ms clock, 0.041+0.28/0.53/4.6+0.091 ms cpu, 60->60->36 MB, 61 MB goal, 2 P > gc 17 @12.525s 0%: 0.018+4.0+0.053 ms clock, 0.036+0.37/0.19/3.4+0.10 ms cpu, 71->72->35 MB, 73 MB goal, 2 P > gc 18 @12.965s 0%: 0.016+8.8+1.2 ms clock, 0.032+0.28/7.6/0.89+2.4 ms cpu, 70->70->54 MB, 71 MB goal, 2 P > gc 19 @13.527s 0%: 0.017+9.6+0.055 ms clock, 0.034+0.69/9.1/0.31+0.11 ms cpu, 112->112->63 MB, 113 MB goal, 2 P > gc 20 @14.193s 0%: 0.017+11+0.044 ms clock, 0.035+0.79/10/0.010+0.088 ms cpu, 136->137->74 MB, 137 MB goal, 2 P > gc 21 @15.072s 0%: 0.015+12+0.047 ms clock, 0.031+0.74/12/0.42+0.095 ms cpu, 166->166->89 MB, 167 MB goal, 2 P > gc 22 @17.645s 0%: 0.019+3.5+0.61 ms clock, 0.039+0.32/1.2/3.0+1.2 ms cpu, 150->155->72 MB, 179 MB goal, 2 P > gc 23 @19.914s 2%: 353+5.0+0.037 ms clock, 706+0.53/4.9/0+0.074 ms cpu, 1131->1131->1112 MB, 1132 MB goal, 2 P > gc 24 @21.064s 2%: 0.016+4.4+0.034 ms clock, 0.032+0.40/3.4/0.93+0.068 ms cpu, 1113->1113->1112 MB, 2225 MB goal, 2 P > gc 25 @111.711s 12%: 2881+28410+4.7 ms clock, 5763+11089/18939/638+9.5 ms cpu, 1688->1704->188 MB, 2225 MB goal, 2 P > scvg0: inuse: 1706, idle: 362, sys: 2069, released: 0, consumed: 2069 (MB) > gc 26 @237.001s 22%: 92+17694+40738 ms clock, 185+1294/15152/0+81477 ms cpu, 339->343->215 MB, 377 MB goal, 2 P > scvg1: inuse: 230, idle: 1838, sys: 2069, released: 0, consumed: 2069 (MB) > GC forced > gc 27 @426.241s 15%: 0.89+7.9+0.049 ms clock, 1.7+0.93/6.0/0.046+0.099 ms cpu, 241->241->165 MB, 430 MB goal, 2 P (https://github.com/golang/go/blob/f8ed4539e/src/runtime/extern.go#L74) The interesting gc rounds are 7 and 8, which go smooth and see the heap grow to 1G and shrink back down, and rounds 23+ which seem to handle something similar but go spectacularly bad, with 80s (!!) STW phases. The peak of the Go heap size there is 1704M, which is very close to my free resident memory. (Note that these are all Go-internal numbers. The scvg rounds tell us that Go asked the OS for 2G and never released any of it.) Admittedly this task seems to require (barely) more heap than I have free memory (I am indeed going to need a bigger boat!), so the scheduler getting stuck deep into paging routines seems likely, with the GC sweep making everything worse by bringing all pages in to follow the pointers, and I guess it would also explain I/O giving it some breathing space. However, such a small over-use of resources maybe shouldn't be this catastrophic, and probably shouldn't freeze the rest of userspace, but I accept more expert opinions. Thanks again.

