Thanks everyone for the pointers, I'm back with more logs and traces.
This is the context, as captured in a tmux after running rcctl start and
ktrace in another window. SSH dropped halfway through.

> $ while true; do sleep 1; date; done & tail -f /var/log/daemon
> Sat Mar 25 23:42:16 UTC 2017
> Sat Mar 25 23:42:17 UTC 2017
> Mar 25 23:42:18 7472837 camlistored[7531]: 2017/03/25 23:42:18 Reading
encryption metadata...
> Sat Mar 25 23:42:18 UTC 2017
> Sat Mar 25 23:42:19 UTC 2017
> Sat Mar 25 23:42:20 UTC 2017
> Sat Mar 25 23:42:21 UTC 2017
> [... intermittent gaps ...]
> Sat Mar 25 23:45:49 UTC 2017
> Sat Mar 25 23:46:05 UTC 2017
> Sat Mar 25 23:46:08 UTC 2017
> Sat Mar 25 23:46:09 UTC 2017
> Mar 25 23:46:08 7472837 Tor[40093]: Your system clock just jumped 175
seconds forward; assuming established circuits no longer work.
> Sat Mar 25 23:46:10 UTC 2017
> Sat Mar 25 23:46:11 UTC 2017
> Sat Mar 25 23:46:13 UTC 2017
> Sat Mar 25 23:46:14 UTC 2017
> [... intermittent gaps ...]
> Sat Mar 25 23:49:24 UTC 2017
> Sat Mar 25 23:49:25 UTC 2017
> Sat Mar 25 23:49:26 UTC 2017
> Sat Mar 25 23:49:41 UTC 2017
> Mar 25 23:49:43 7472837 ntpd[64718]: IMSG_CONSTRAINT_KILL for invalid id 2
> Sat Mar 25 23:49:44 UTC 2017
> Sat Mar 25 23:49:46 UTC 2017
> Sat Mar 25 23:49:56 UTC 2017
> Sat Mar 25 23:49:57 UTC 2017
> [... intermittent gaps ...]
> Sat Mar 25 23:54:40 UTC 2017
> Sat Mar 25 23:54:41 UTC 2017
> Sat Mar 25 23:54:42 UTC 2017
> Sat Mar 25 23:54:51 UTC 2017
> Sat Mar 25 23:54:56 UTC 2017
> Sat Mar 25 23:54:57 UTC 2017
> Sat Mar 25 23:54:58 UTC 2017
> Sat Mar 25 23:54:59 UTC 2017
> Sat Mar 25 23:55:22 UTC 2017
> Mar 25 23:55:23 7472837 camlistored[7531]: 2017/03/25 23:55:23 Caught panic
installer handlers: error instantiating storage for prefix "/enc-b2/", type
"encrypt": error scanning metadata on start-up: error with meta blob sha1-xxx:
fetch failed: Get https://f001.backblazeb2.com/file/xxx/meta/sha1-xxx:
net/http: TLS handshake timeout
> Mar 25 23:55:23 7472837 Tor[40093]: Your system clock just jumped 178
seconds forward; assuming established circuits no longer work.
> Sat Mar 25 23:55:27 UTC 2017
> Mar 25 23:55:27 7472837 Tor[40093]: Tor has successfully opened a circuit.
Looks like client functionality is working.
> Sat Mar 25 23:55:28 UTC 2017
> Sat Mar 25 23:55:29 UTC 2017

And here's the ktrace -di -p which is... essentially empty. This is an
excerpt from deep into the freeze.

>   7531 camlistored 1490486112.921212 STRU  struct timespec {
459373.989343598 }
>   7531 camlistored 1490486112.921215 RET   clock_gettime 0
>   7531 camlistored 1490486112.921217 CALL 
clock_gettime(CLOCK_REALTIME,0xc420031ef8)
>   7531 camlistored 1490486112.921220 STRU  struct timespec {
1490486112<"Mar 25 23:55:12 2017">.921219705 }
>   7531 camlistored 1490486112.921222 RET   clock_gettime 0
>   7531 camlistored 1490486112.921225 CALL 
kevent(5,0,0,0xc420031730,64,0xc420031718)
>   7531 camlistored 1490486112.921228 STRU  struct timespec { 0 }
>   7531 camlistored 1490486112.921231 RET   kevent 0
>   7531 camlistored 1490486112.921234 CALL  nanosleep(0xc420031f20,0)
>   7531 camlistored 1490486112.921237 STRU  struct timespec { 0.010000000
}
>   7531 camlistored 1490486112.941202 RET   nanosleep 0
>   7531 camlistored 1490486112.941211 CALL 
clock_gettime(CLOCK_MONOTONIC,0xc420031f20)
>   7531 camlistored 1490486112.941216 STRU  struct timespec {
459374.009347523 }
>   7531 camlistored 1490486112.941219 RET   clock_gettime 0
>   7531 camlistored 1490486112.941221 CALL 
clock_gettime(CLOCK_REALTIME,0xc420031ef8)
>   7531 camlistored 1490486112.941224 STRU  struct timespec {
1490486112<"Mar 25 23:55:12 2017">.941223979 }
>   7531 camlistored 1490486112.941226 RET   clock_gettime 0
>   7531 camlistored 1490486112.941229 CALL 
kevent(5,0,0,0xc420031730,64,0xc420031718)
>   7531 camlistored 1490486112.941232 STRU  struct timespec { 0 }
>   7531 camlistored 1490486112.941235 RET   kevent 0
>   7531 camlistored 1490486112.941238 CALL  nanosleep(0xc420031f20,0)
>   7531 camlistored 1490486112.941240 STRU  struct timespec { 0.010000000
}
>   7531 camlistored 1490486112.961205 RET   nanosleep 0
>   7531 camlistored 1490486112.961214 CALL 
clock_gettime(CLOCK_MONOTONIC,0xc420031f20)
>   7531 camlistored 1490486112.961220 STRU  struct timespec {
459374.029350819 }
>   7531 camlistored 1490486112.961222 RET   clock_gettime 0
>   7531 camlistored 1490486112.961224 CALL 
clock_gettime(CLOCK_REALTIME,0xc420031ef8)
>   7531 camlistored 1490486112.961227 STRU  struct timespec {
1490486112<"Mar 25 23:55:12 2017">.961227136 }
>   7531 camlistored 1490486112.961229 RET   clock_gettime 0
>   7531 camlistored 1490486112.961232 CALL 
kevent(5,0,0,0xc420031730,64,0xc420031718)
>   7531 camlistored 1490486112.961235 STRU  struct timespec { 0 }
>   7531 camlistored 1490486112.961238 RET   kevent 0
>   7531 camlistored 1490486112.961241 CALL  nanosleep(0xc420031f20,0)
>   7531 camlistored 1490486112.961244 STRU  struct timespec { 0.010000000
}
>   7531 camlistored 1490486112.981207 RET   nanosleep 0
>   7531 camlistored 1490486112.981215 CALL 
clock_gettime(CLOCK_MONOTONIC,0xc420031f20)
>   7531 camlistored 1490486112.981221 STRU  struct timespec {
459374.049352090 }
>   7531 camlistored 1490486112.981223 RET   clock_gettime 0
>   7531 camlistored 1490486112.981225 CALL 
clock_gettime(CLOCK_REALTIME,0xc420031ef8)
>   7531 camlistored 1490486112.981229 STRU  struct timespec {
1490486112<"Mar 25 23:55:12 2017">.981228616 }
>   7531 camlistored 1490486112.981231 RET   clock_gettime 0
>   7531 camlistored 1490486112.981234 CALL 
kevent(5,0,0,0xc420031730,64,0xc420031718)
>   7531 camlistored 1490486112.981237 STRU  struct timespec { 0 }
>   7531 camlistored 1490486112.981239 RET   kevent 0
>   7531 camlistored 1490486112.981242 CALL  nanosleep(0xc420031f20,0)
>   7531 camlistored 1490486112.981245 STRU  struct timespec { 0.010000000
}

The only two other events happened in the surrounding 10 seconds are
these.

>   7531 camlistored 1490486111.961230 STRU  struct kevent { ident=11,
filter=EVFILT_READ, flags=0x8021<EV_ADD|EV_CLEAR|EV_EOF>, fflags=0<>,
data=4175, udata=0x2280f96c0 }
>   7531 camlistored 1490486111.961232 RET   kevent 1
>   7531 camlistored 1490486111.981238 STRU  struct kevent { ident=85,
filter=EVFILT_READ, flags=0x8021<EV_ADD|EV_CLEAR|EV_EOF>, fflags=0<>,
data=4175, udata=0x2280f9480 }
>   7531 camlistored 1490486111.981241 RET   kevent 1

Something I noticed is that generating I/O during that step, either to
syslog or to stderr (by increasing verbosity), significantly increases
the chances of the process making it through and seems to reduce the
pauses in other processes.

Drilling a little more into memory utilization with a gctrace seems to
suggest that Go does obtain from the OS a bigger allocation than there
is resident memory, but never really uses all of it, so I'd expect most
hot pages to fit in memory.

> gc 1 @0.511s 3%: 18+7.6+0.049 ms clock, 37+4.0/0.23/5.3+0.098 ms cpu,
4->4->1 MB, 5 MB goal, 2 P
> gc 2 @0.551s 3%: 0.013+4.1+0.046 ms clock, 0.026+0/0.010/4.0+0.092 ms cpu,
4->4->2 MB, 5 MB goal, 2 P
> gc 3 @0.574s 3%: 0.013+1.6+0.051 ms clock, 0.027+0.24/0.10/1.4+0.10 ms cpu,
5->6->4 MB, 6 MB goal, 2 P
> gc 4 @2.518s 0%: 0.015+5.4+0.045 ms clock, 0.030+0.39/0.67/5.4+0.090 ms cpu,
11->11->8 MB, 12 MB goal, 2 P
> gc 5 @2.531s 1%: 8.9+2.6+0.026 ms clock, 17+0.18/0.65/2.1+0.052 ms cpu,
12->12->8 MB, 17 MB goal, 2 P
> gc 6 @2.663s 1%: 0.015+2.9+0.038 ms clock, 0.031+0.23/1.0/2.1+0.077 ms cpu,
17->17->12 MB, 18 MB goal, 2 P
> gc 7 @3.257s 0%: 0.017+3.0+0.048 ms clock, 0.035+0.26/1.2/1.7+0.096 ms cpu,
1038->1038->1037 MB, 1039 MB goal, 2 P
> gc 8 @10.316s 0%: 0.011+2.0+0.040 ms clock, 0.023+0/1.9/1.9+0.080 ms cpu,
1038->1038->13 MB, 2074 MB goal, 2 P
> gc 9 @10.619s 0%: 10+5.8+0.037 ms clock, 20+0.41/5.7/0+0.074 ms cpu,
20->21->17 MB, 27 MB goal, 2 P
> gc 10 @10.732s 0%: 7.4+3.6+0.040 ms clock, 14+0.11/3.6/0+0.080 ms cpu,
31->31->19 MB, 35 MB goal, 2 P
> gc 11 @10.811s 0%: 0.014+4.2+0.28 ms clock, 0.028+0.34/2.0/1.7+0.56 ms cpu,
35->35->19 MB, 38 MB goal, 2 P
> gc 12 @10.978s 0%: 0.015+4.4+0.042 ms clock, 0.031+0.28/3.6/0.49+0.084 ms
cpu, 37->37->23 MB, 38 MB goal, 2 P
> gc 13 @11.211s 0%: 0.019+5.6+0.040 ms clock, 0.038+0.28/4.8/0.53+0.081 ms
cpu, 46->46->29 MB, 47 MB goal, 2 P
> gc 14 @11.482s 0%: 0.015+4.8+0.038 ms clock, 0.031+0.85/4.0/0.38+0.077 ms
cpu, 56->57->29 MB, 58 MB goal, 2 P
> gc 15 @11.752s 0%: 0.19+4.2+0.17 ms clock, 0.39+1.7/2.6/0+0.34 ms cpu,
59->59->30 MB, 60 MB goal, 2 P
> gc 16 @12.002s 0%: 0.020+5.2+0.045 ms clock, 0.041+0.28/0.53/4.6+0.091 ms
cpu, 60->60->36 MB, 61 MB goal, 2 P
> gc 17 @12.525s 0%: 0.018+4.0+0.053 ms clock, 0.036+0.37/0.19/3.4+0.10 ms
cpu, 71->72->35 MB, 73 MB goal, 2 P
> gc 18 @12.965s 0%: 0.016+8.8+1.2 ms clock, 0.032+0.28/7.6/0.89+2.4 ms cpu,
70->70->54 MB, 71 MB goal, 2 P
> gc 19 @13.527s 0%: 0.017+9.6+0.055 ms clock, 0.034+0.69/9.1/0.31+0.11 ms
cpu, 112->112->63 MB, 113 MB goal, 2 P
> gc 20 @14.193s 0%: 0.017+11+0.044 ms clock, 0.035+0.79/10/0.010+0.088 ms
cpu, 136->137->74 MB, 137 MB goal, 2 P
> gc 21 @15.072s 0%: 0.015+12+0.047 ms clock, 0.031+0.74/12/0.42+0.095 ms cpu,
166->166->89 MB, 167 MB goal, 2 P
> gc 22 @17.645s 0%: 0.019+3.5+0.61 ms clock, 0.039+0.32/1.2/3.0+1.2 ms cpu,
150->155->72 MB, 179 MB goal, 2 P
> gc 23 @19.914s 2%: 353+5.0+0.037 ms clock, 706+0.53/4.9/0+0.074 ms cpu,
1131->1131->1112 MB, 1132 MB goal, 2 P
> gc 24 @21.064s 2%: 0.016+4.4+0.034 ms clock, 0.032+0.40/3.4/0.93+0.068 ms
cpu, 1113->1113->1112 MB, 2225 MB goal, 2 P
> gc 25 @111.711s 12%: 2881+28410+4.7 ms clock, 5763+11089/18939/638+9.5 ms
cpu, 1688->1704->188 MB, 2225 MB goal, 2 P
> scvg0: inuse: 1706, idle: 362, sys: 2069, released: 0, consumed: 2069 (MB)
> gc 26 @237.001s 22%: 92+17694+40738 ms clock, 185+1294/15152/0+81477 ms cpu,
339->343->215 MB, 377 MB goal, 2 P
> scvg1: inuse: 230, idle: 1838, sys: 2069, released: 0, consumed: 2069 (MB)
> GC forced
> gc 27 @426.241s 15%: 0.89+7.9+0.049 ms clock, 1.7+0.93/6.0/0.046+0.099 ms
cpu, 241->241->165 MB, 430 MB goal, 2 P

(https://github.com/golang/go/blob/f8ed4539e/src/runtime/extern.go#L74)

The interesting gc rounds are 7 and 8, which go smooth and see the heap
grow to 1G and shrink back down, and rounds 23+ which seem to handle
something similar but go spectacularly bad, with 80s (!!) STW phases.
The peak of the Go heap size there is 1704M, which is very close to my
free resident memory. (Note that these are all Go-internal numbers. The
scvg rounds tell us that Go asked the OS for 2G and never released any
of it.)

Admittedly this task seems to require (barely) more heap than I have
free memory (I am indeed going to need a bigger boat!), so the scheduler
getting stuck deep into paging routines seems likely, with the GC sweep
making everything worse by bringing all pages in to follow the pointers,
and I guess it would also explain I/O giving it some breathing space.
However, such a small over-use of resources maybe shouldn't be this
catastrophic, and probably shouldn't freeze the rest of userspace, but I
accept more expert opinions.

Thanks again.

Reply via email to