Re: [go-nuts] OOM occurring with a small heap

2019-07-03 Thread 'Yunchi Luo' via golang-nuts
I actually just managed to trace down the root cause of the bug, and it's
quite surprising. It's not a heap overflow, it's a stack overflow, due to a
bug in the stdlib! Specifically, adding the same http.ClientTrace twice
onto a request context causes a stack overflow.

https://github.com/mightyguava/tracereflectbug/blob/480217947e3729a3fa833f6ece8bcf0d8012aaa3/main.go#L20-L22


This explains everything we've been seeing in the memory/gc logs.

The story goes as follows:

There was a bug in my code for adding a http trace (copied earlier in this
thread). I added the trace to the request the AWS SDK Client's Send
handler, which is triggered on all retries, not just the first request. On
rare occasions (every few hours apparently), a network issue or dynamodb
backend issue causes retry in multiple client goroutines to happen at the
same time. Since multiple goroutines are experiencing runaway stacks at the
same time, service gets killed for OOM before any stack reaches the Go
10-byte limit, so we never see a stack overflow error.

This explains relatively small increase in heap and much larger increase in
RSS. I bet if I actually logged StackInUse and StackSys from
runtime.Memstats, they would have shown an increase much closer to that of
RSS.

Bug filed: https://github.com/golang/go/issues/32925. Second stdlib bug
building my little CRUD http server...

Thanks all for helping with the debugging!

On Wed, Jul 3, 2019 at 11:34 AM Tom Mitchell  wrote:

>
> On Mon, Jul 1, 2019 at 12:42 PM 'Yunchi Luo' via golang-nuts <
> golang-nuts@googlegroups.com> wrote:
>
>> Hello, I'd like to solicit some help with a weird GC issue we are seeing.
>>
>> I'm trying to debug OOM on a service we are running in k8s. The service
>> is just a CRUD server hitting a database (DynamoDB). Each replica serves
>> about 300 qps of traffic. There are no memory leaks. On occasion (seemingly
>> correlated to small latency spikes on the backend), the service would OOM.
>> This is surprising because it has a circuit breaker that drops requests
>> after 200 concurrent connections that has never trips, and goroutine
>> profiles confirm that there are nowhere 200 active goroutines.
>>
>
> Just curious about the network connections.
> Is there a chance that the network connections are not getting closed and
> cleaned up for some reason.
> It was common for sockets to hang around in the thousands because user was
> killing a slow tab or the browser
> and the full socket close never completed.The solution was  to allow
> reliable connections to time out and finish closing
> freeing up the memory.   The application has closed the socket but the
> protocol has yet to get the last packet to complete the
> handshake.  The shell equivalent would be zombe processes that still need
> to return the exit status but no process waits
> on the status.   Debugging can be interesting in the shell case because of
> implied waits done by ps.
>
> How many connections does the the system kernel think there are and what
> state are they are in.
> Look both locally and on the DB machine.
> The latency spikes can be a cause or a symptom.
> Look at the connections being made to the CRUD server and make sure they
> are being setup with short enough timers
> that they clean themselves up quickly enough.   Is the CRUD server at risk
> of a denial of service or random curious probe burst
> from a nmap script.  Even firewall drops near or far can leave connections
> hanging in an incomplete state when an invalid connection
> is detected and blocked and long timer reliable network connections are
> involved.
>
>
>
>
>
>
>

-- 
Yunchi

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CANnT9sj69JLgVKU8YMAYrqKjnPUOfOpe9UzAcdttcLcDmtzVQg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [go-nuts] OOM occurring with a small heap

2019-07-03 Thread Tom Mitchell
On Mon, Jul 1, 2019 at 12:42 PM 'Yunchi Luo' via golang-nuts <
golang-nuts@googlegroups.com> wrote:

> Hello, I'd like to solicit some help with a weird GC issue we are seeing.
>
> I'm trying to debug OOM on a service we are running in k8s. The service is
> just a CRUD server hitting a database (DynamoDB). Each replica serves about
> 300 qps of traffic. There are no memory leaks. On occasion (seemingly
> correlated to small latency spikes on the backend), the service would OOM.
> This is surprising because it has a circuit breaker that drops requests
> after 200 concurrent connections that has never trips, and goroutine
> profiles confirm that there are nowhere 200 active goroutines.
>

Just curious about the network connections.
Is there a chance that the network connections are not getting closed and
cleaned up for some reason.
It was common for sockets to hang around in the thousands because user was
killing a slow tab or the browser
and the full socket close never completed.The solution was  to allow
reliable connections to time out and finish closing
freeing up the memory.   The application has closed the socket but the
protocol has yet to get the last packet to complete the
handshake.  The shell equivalent would be zombe processes that still need
to return the exit status but no process waits
on the status.   Debugging can be interesting in the shell case because of
implied waits done by ps.

How many connections does the the system kernel think there are and what
state are they are in.
Look both locally and on the DB machine.
The latency spikes can be a cause or a symptom.
Look at the connections being made to the CRUD server and make sure they
are being setup with short enough timers
that they clean themselves up quickly enough.   Is the CRUD server at risk
of a denial of service or random curious probe burst
from a nmap script.  Even firewall drops near or far can leave connections
hanging in an incomplete state when an invalid connection
is detected and blocked and long timer reliable network connections are
involved.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAAMy4UTgoxygmb5UE%2BtUhDgdkS--VmtbT5U8kF0aXQgwXPwA0w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [go-nuts] OOM occurring with a small heap

2019-07-02 Thread 'Yunchi Luo' via golang-nuts
Yeah, I've been looking at the goroutine profiles and there are some
strange stacks like the below.

 1   reflect.flag.mustBeExported /go/src/reflect/value.go:213
 reflect.Value.call /go/src/reflect/value.go:424
 reflect.Value.Call /go/src/reflect/value.go:308
 net/http/httptrace.(*ClientTrace).compose.func1
/go/src/net/http/httptrace/trace.go:203
 reflect.callReflect /go/src/reflect/value.go:536
 reflect.makeFuncStub /go/src/reflect/asm_amd64.s:20
 reflect.Value.call /go/src/reflect/value.go:447
 reflect.Value.Call /go/src/reflect/value.go:308
 net/http/httptrace.(*ClientTrace).compose.func1
/go/src/net/http/httptrace/trace.go:204
 reflect.callReflect /go/src/reflect/value.go:536
 reflect.makeFuncStub /go/src/reflect/asm_amd64.s:20
 reflect.Value.call /go/src/reflect/value.go:447
 reflect.Value.Call /go/src/reflect/value.go:308
 net/http/httptrace.(*ClientTrace).compose.func1
/go/src/net/http/httptrace/trace.go:204
 reflect.callReflect /go/src/reflect/value.go:536
 reflect.makeFuncStub /go/src/reflect/asm_amd64.s:20
 reflect.Value.call /go/src/reflect/value.go:447
 reflect.Value.Call /go/src/reflect/value.go:308
 net/http/httptrace.(*ClientTrace).compose.func1
/go/src/net/http/httptrace/trace.go:204
 reflect.callReflect /go/src/reflect/value.go:536
 reflect.makeFuncStub /go/src/reflect/asm_amd64.s:20
 reflect.Value.call /go/src/reflect/value.go:447
 reflect.Value.Call /go/src/reflect/value.go:308
 net/http/httptrace.(*ClientTrace).compose.func1
/go/src/net/http/httptrace/trace.go:204
 reflect.callReflect /go/src/reflect/value.go:536
 reflect.makeFuncStub /go/src/reflect/asm_amd64.s:20
 reflect.Value.call /go/src/reflect/value.go:447
 reflect.Value.Call /go/src/reflect/value.go:308
 net/http/httptrace.(*ClientTrace).compose.func1
/go/src/net/http/httptrace/trace.go:204
 reflect.callReflect /go/src/reflect/value.go:536
 reflect.makeFuncStub /go/src/reflect/asm_amd64.s:20
 reflect.Value.call /go/src/reflect/value.go:447

The reflect.MakeFunc is invoked by the httptrace package. From what I could
gather, when you call httptrace.WithClientTrace(context.Context,
*httptrace.ClientTrace), the package will use reflection to compose any
existing httptrace.ClientTrace objects on the context with the trace you
provide. I only provide 1 trace, so it's strange there are 6 traces being
composed.

The code for adding the trace is pretty straightforward:

// This is called exactly once on service startup to insert a hook before
every request to DynamoDB (from the Go AWS SDK)
func instrument(dynamo *dynamodb.DynamoDB) {
  dynamo.Client.Handlers.Send.PushFront(metricHandler())
}

func metricHandler() func(r *request.Request) {
  // There should be exactly 1 instance of this trace object, since the
metricHandler() function is invoked once
  tracer := {
ConnectStart: func(_, _ string) {
  // Incrementing a prometheus counter
  newConns.Inc()
},
  }
  return func(r *request.Request) {
// This should get applied once per request
r.SetContext(httptrace.WithClientTrace(r.Context(), tracer))
  }
}

I've done a little digging along this vector, but there isn't much to look
at. A value from httptrace is applied to a hook in the AWS Go SDK.

On Tue, Jul 2, 2019 at 3:46 PM andrey mirtchovski 
wrote:

> ok, this is interesting:
>
> reflect.MakeFunc: i've never done this before. what are the allocation
> patterns for creating functions with reflect? i see a few crashes
> related to these functions but no mentioning of severe memory
> consumption.
>
> in my opinion, trying to capture MakeFunc patterns from your original
> program in a small example will be helpful.
>
> On Tue, Jul 2, 2019 at 1:28 PM Bakul Shah  wrote:
> >
> > Switching Go version seems like a stab in the dark. If the OOM symptom
> does show up, you have simply wasted time. If it doesn't show up, you still
> don't know if the bug exists and is simply hiding. Even if you think the
> bug in Go code generation (or GC) and not in your code, there is nothing
> the Go developers can do without a concrete test. So then you have to keep
> changing things until the symptom disappears for a long time and cross
> fingers the bug went away
> >
> > If the symptom disappears when you take out httptrace, now you have
> narrowed the hypothesis to httptrace but you haven't proven it. You still
> don't know if removing httptrace changed the timing behavior such that the
> bug is hiding somewhere and if it will show up in future.
> >
> > This is why one should change as little as possible outside of what is
> needed for testing a specific hypothesis and why you should have a test
> that 

Re: [go-nuts] OOM occurring with a small heap

2019-07-02 Thread andrey mirtchovski
ok, this is interesting:

reflect.MakeFunc: i've never done this before. what are the allocation
patterns for creating functions with reflect? i see a few crashes
related to these functions but no mentioning of severe memory
consumption.

in my opinion, trying to capture MakeFunc patterns from your original
program in a small example will be helpful.

On Tue, Jul 2, 2019 at 1:28 PM Bakul Shah  wrote:
>
> Switching Go version seems like a stab in the dark. If the OOM symptom does 
> show up, you have simply wasted time. If it doesn't show up, you still don't 
> know if the bug exists and is simply hiding. Even if you think the bug in Go 
> code generation (or GC) and not in your code, there is nothing the Go 
> developers can do without a concrete test. So then you have to keep changing 
> things until the symptom disappears for a long time and cross fingers the bug 
> went away
>
> If the symptom disappears when you take out httptrace, now you have narrowed 
> the hypothesis to httptrace but you haven't proven it. You still don't know 
> if removing httptrace changed the timing behavior such that the bug is hiding 
> somewhere and if it will show up in future.
>
> This is why one should change as little as possible outside of what is needed 
> for testing a specific hypothesis and why you should have a test that tickles 
> the bug. By "proving" I mean you must find a specific assertion that is 
> violated.
>
> I'd instrument the code under question and somehow capture the history of 
> last N seconds just before OOM. Capturing heap profile is just one way to 
> look at what your code does. You can find other ways to look at what your 
> code does and create a log. For instance, the extra memory use is also a 
> symptom just like OOM; what you may want to check is whether your data 
> structures are consistent. The actual inconsistency may have occurred long 
> before the OOM crash.
>
> Note that the GC logs allow you to see some aspect of the GC behavior better 
> but if you don't understand it well enough, it may seem mysterious (compared 
> to your mental model) and later, when you run out of other hypotheses, it may 
> even seem suspicious, so I'd be careful about looking at such things :-)
>
> On Jul 2, 2019, at 11:16 AM, 'Yunchi Luo' via golang-nuts 
>  wrote:
>
> I removed the httptrace call yesterday and there have been no OOMs yet. Going 
> to let it bake for another day. If OOMs show up again, I'll try reverting to 
> an older Go version tomorrow. Otherwise I'll point my finger at httptrace I 
> guess.
>
> On Tue, Jul 2, 2019 at 2:15 PM Yunchi Luo  wrote:
>>
>> I did try to do that! I have 3 heap profiles captured from the ~3 seconds 
>> before crash. The only thing particularly suspicious is the httptrace call I 
>> mentioned earlier in the thread.
>>
>> Diffing 1 to 2
>> (pprof) cum
>> (pprof) top 50
>> Showing nodes accounting for 4604.15kB, 81.69% of 5636.17kB total
>>   flat  flat%   sum%cum   cum%
>>  0 0% 0%  5120.16kB 90.84%  
>> net/http/httptrace.(*ClientTrace).compose.func1
>>  0 0% 0%  5120.16kB 90.84%  reflect.Value.Call
>>  0 0% 0%  5120.16kB 90.84%  reflect.Value.call
>>  0 0% 0%  5120.16kB 90.84%  reflect.callReflect
>>  5120.16kB 90.84% 90.84%  5120.16kB 90.84%  reflect.funcLayout.func1
>>  0 0% 90.84%  5120.16kB 90.84%  reflect.makeFuncStub
>>  0 0% 90.84%  4604.15kB 81.69%  sync.(*Pool).Get
>>  0 0% 90.84%  -516.01kB  9.16%  io.Copy
>>  0 0% 90.84%  -516.01kB  9.16%  io.copyBuffer
>>  0 0% 90.84%  -516.01kB  9.16%  io/ioutil.devNull.ReadFrom
>>  -516.01kB  9.16% 81.69%  -516.01kB  9.16%  io/ioutil.glob..func1
>>
>> Diff 2 to 3
>> (pprof) top 50
>> Showing nodes accounting for 7680.44kB, 100% of 7680.44kB total
>>   flat  flat%   sum%cum   cum%
>>  0 0% 0%  6144.18kB 80.00%  
>> net/http/httptrace.(*ClientTrace).compose.func1
>>  0 0% 0%  6144.18kB 80.00%  reflect.Value.Call
>>  0 0% 0%  6144.18kB 80.00%  reflect.Value.call
>>   512.01kB  6.67%  6.67%  6144.18kB 80.00%  reflect.callReflect
>>  0 0%  6.67%  6144.18kB 80.00%  reflect.makeFuncStub
>>  5632.17kB 73.33% 80.00%  5632.17kB 73.33%  reflect.funcLayout.func1
>>  0 0% 80.00%  5632.17kB 73.33%  sync.(*Pool).Get
>>  0 0% 80.00%  1024.23kB 13.34%  
>> github.com/aws/aws-sdk-go/aws/request.(*HandlerList).Run
>>  0 0% 80.00%  1024.23kB 13.34%  
>> github.com/aws/aws-sdk-go/aws/request.(*Request).Send
>>  0 0% 80.00%  1024.23kB 13.34%  
>> github.com/aws/aws-sdk-go/aws/request.(*Request).Sign
>>  0 0% 80.00%  1024.23kB 13.34%  
>> github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).build
>>  0 0% 80.00%  1024.23kB 13.34%  
>> github.com/aws/aws-sdk-go/aws/signer/v4.SignSDKRequest
>>  0 0% 80.00%  1024.23kB 13.34%  
>> 

Re: [go-nuts] OOM occurring with a small heap

2019-07-02 Thread Bakul Shah
Switching Go version seems like a stab in the dark. If the OOM symptom does 
show up, you have simply wasted time. If it doesn't show up, you still don't 
know if the bug exists and is simply hiding. Even if you think the bug in Go 
code generation (or GC) and not in your code, there is nothing the Go 
developers can do without a concrete test. So then you have to keep changing 
things until the symptom disappears for a long time and cross fingers the bug 
went away

If the symptom disappears when you take out httptrace, now you have narrowed 
the hypothesis to httptrace but you haven't proven it. You still don't know if 
removing httptrace changed the timing behavior such that the bug is hiding 
somewhere and if it will show up in future.

This is why one should change as little as possible outside of what is needed 
for testing a specific hypothesis and why you should have a test that tickles 
the bug. By "proving" I mean you must find a specific assertion that is 
violated.

I'd instrument the code under question and somehow capture the history of last 
N seconds just before OOM. Capturing heap profile is just one way to look at 
what your code does. You can find other ways to look at what your code does and 
create a log. For instance, the extra memory use is also a symptom just like 
OOM; what you may want to check is whether your data structures are consistent. 
The actual inconsistency may have occurred long before the OOM crash.

Note that the GC logs allow you to see some aspect of the GC behavior better 
but if you don't understand it well enough, it may seem mysterious (compared to 
your mental model) and later, when you run out of other hypotheses, it may even 
seem suspicious, so I'd be careful about looking at such things :-)

> On Jul 2, 2019, at 11:16 AM, 'Yunchi Luo' via golang-nuts 
>  wrote:
> 
> I removed the httptrace call yesterday and there have been no OOMs yet. Going 
> to let it bake for another day. If OOMs show up again, I'll try reverting to 
> an older Go version tomorrow. Otherwise I'll point my finger at httptrace I 
> guess.
> 
> On Tue, Jul 2, 2019 at 2:15 PM Yunchi Luo  > wrote:
> I did try to do that! I have 3 heap profiles captured from the ~3 seconds 
> before crash. The only thing particularly suspicious is the httptrace call I 
> mentioned earlier in the thread.
> 
> Diffing 1 to 2
> (pprof) cum
> (pprof) top 50
> Showing nodes accounting for 4604.15kB, 81.69% of 5636.17kB total
>   flat  flat%   sum%cum   cum%
>  0 0% 0%  5120.16kB 90.84%  
> net/http/httptrace.(*ClientTrace).compose.func1
>  0 0% 0%  5120.16kB 90.84%  reflect.Value.Call
>  0 0% 0%  5120.16kB 90.84%  reflect.Value.call
>  0 0% 0%  5120.16kB 90.84%  reflect.callReflect
>  5120.16kB 90.84% 90.84%  5120.16kB 90.84%  reflect.funcLayout.func1
>  0 0% 90.84%  5120.16kB 90.84%  reflect.makeFuncStub
>  0 0% 90.84%  4604.15kB 81.69%  sync.(*Pool).Get
>  0 0% 90.84%  -516.01kB  9.16%  io.Copy
>  0 0% 90.84%  -516.01kB  9.16%  io.copyBuffer
>  0 0% 90.84%  -516.01kB  9.16%  io/ioutil.devNull.ReadFrom
>  -516.01kB  9.16% 81.69%  -516.01kB  9.16%  io/ioutil.glob..func1
> 
> Diff 2 to 3
> (pprof) top 50
> Showing nodes accounting for 7680.44kB, 100% of 7680.44kB total
>   flat  flat%   sum%cum   cum%
>  0 0% 0%  6144.18kB 80.00%  
> net/http/httptrace.(*ClientTrace).compose.func1
>  0 0% 0%  6144.18kB 80.00%  reflect.Value.Call
>  0 0% 0%  6144.18kB 80.00%  reflect.Value.call
>   512.01kB  6.67%  6.67%  6144.18kB 80.00%  reflect.callReflect
>  0 0%  6.67%  6144.18kB 80.00%  reflect.makeFuncStub
>  5632.17kB 73.33% 80.00%  5632.17kB 73.33%  reflect.funcLayout.func1
>  0 0% 80.00%  5632.17kB 73.33%  sync.(*Pool).Get
>  0 0% 80.00%  1024.23kB 13.34%  
> github.com/aws/aws-sdk-go/aws/request.(*HandlerList).Run 
> 
>  0 0% 80.00%  1024.23kB 13.34%  
> github.com/aws/aws-sdk-go/aws/request.(*Request).Send 
> 
>  0 0% 80.00%  1024.23kB 13.34%  
> github.com/aws/aws-sdk-go/aws/request.(*Request).Sign 
> 
>  0 0% 80.00%  1024.23kB 13.34%  
> github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).build 
> 
>  0 0% 80.00%  1024.23kB 13.34%  
> github.com/aws/aws-sdk-go/aws/signer/v4.SignSDKRequest 
> 
>  0 0% 80.00%  1024.23kB 13.34%  
> github.com/aws/aws-sdk-go/aws/signer/v4.SignSDKRequestWithCurrentTime 
> 
>  

Re: [go-nuts] OOM occurring with a small heap

2019-07-02 Thread 'Yunchi Luo' via golang-nuts
I removed the httptrace call yesterday and there have been no OOMs yet.
Going to let it bake for another day. If OOMs show up again, I'll try
reverting to an older Go version tomorrow. Otherwise I'll point my finger
at httptrace I guess.

On Tue, Jul 2, 2019 at 2:15 PM Yunchi Luo  wrote:

> I did try to do that! I have 3 heap profiles captured from the ~3 seconds
> before crash. The only thing particularly suspicious is the httptrace call
> I mentioned earlier in the thread.
>
> Diffing 1 to 2
> (pprof) cum
> (pprof) top 50
> Showing nodes accounting for 4604.15kB, 81.69% of 5636.17kB total
>   flat  flat%   sum%cum   cum%
>  0 0% 0%  5120.16kB 90.84%
>  net/http/httptrace.(*ClientTrace).compose.func1
>  0 0% 0%  5120.16kB 90.84%  reflect.Value.Call
>  0 0% 0%  5120.16kB 90.84%  reflect.Value.call
>  0 0% 0%  5120.16kB 90.84%  reflect.callReflect
>  5120.16kB 90.84% 90.84%  5120.16kB 90.84%  reflect.funcLayout.func1
>  0 0% 90.84%  5120.16kB 90.84%  reflect.makeFuncStub
>  0 0% 90.84%  4604.15kB 81.69%  sync.(*Pool).Get
>  0 0% 90.84%  -516.01kB  9.16%  io.Copy
>  0 0% 90.84%  -516.01kB  9.16%  io.copyBuffer
>  0 0% 90.84%  -516.01kB  9.16%  io/ioutil.devNull.ReadFrom
>  -516.01kB  9.16% 81.69%  -516.01kB  9.16%  io/ioutil.glob..func1
>
> Diff 2 to 3
> (pprof) top 50
> Showing nodes accounting for 7680.44kB, 100% of 7680.44kB total
>   flat  flat%   sum%cum   cum%
>  0 0% 0%  6144.18kB 80.00%
>  net/http/httptrace.(*ClientTrace).compose.func1
>  0 0% 0%  6144.18kB 80.00%  reflect.Value.Call
>  0 0% 0%  6144.18kB 80.00%  reflect.Value.call
>   512.01kB  6.67%  6.67%  6144.18kB 80.00%  reflect.callReflect
>  0 0%  6.67%  6144.18kB 80.00%  reflect.makeFuncStub
>  5632.17kB 73.33% 80.00%  5632.17kB 73.33%  reflect.funcLayout.func1
>  0 0% 80.00%  5632.17kB 73.33%  sync.(*Pool).Get
>  0 0% 80.00%  1024.23kB 13.34%
> github.com/aws/aws-sdk-go/aws/request.(*HandlerList).Run
>  0 0% 80.00%  1024.23kB 13.34%
> github.com/aws/aws-sdk-go/aws/request.(*Request).Send
>  0 0% 80.00%  1024.23kB 13.34%
> github.com/aws/aws-sdk-go/aws/request.(*Request).Sign
>  0 0% 80.00%  1024.23kB 13.34%
> github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).build
>  0 0% 80.00%  1024.23kB 13.34%
> github.com/aws/aws-sdk-go/aws/signer/v4.SignSDKRequest
>  0 0% 80.00%  1024.23kB 13.34%
> github.com/aws/aws-sdk-go/aws/signer/v4.SignSDKRequestWithCurrentTime
>  0 0% 80.00%  1024.23kB 13.34%
> github.com/aws/aws-sdk-go/aws/signer/v4.Signer.signWithBody
>  0 0% 80.00%  1024.23kB 13.34%
> github.com/aws/aws-sdk-go/service/dynamodb.(*DynamoDB).GetItemWithContext
>
> On Tue, Jul 2, 2019 at 2:08 PM andrey mirtchovski 
> wrote:
>
>> What I have found useful in the past is pprof's ability to diff profiles.
>> That means that if you capture heap profiles at regular intervals you can
>> see a much smaller subset of changes and compare allocation patterns.
>>
>> On Tue, Jul 2, 2019, 10:53 AM 'Yunchi Luo' via golang-nuts <
>> golang-nuts@googlegroups.com> wrote:
>>
>>> I'm not so much pointing my finger at GC as I am hoping GC logs could
>>> help tell the story, and that someone with a strong understanding of GC in
>>> Go could weigh in here. In the last 4 seconds before OOM, "TotalAlloc"
>>> increased by only 80M, yet "HeapIdle" increased to 240M from 50M, RSS
>>> increased by 810M. The numbers don't add up for me. A running sum of 80M of
>>> heap objects were allocated in the time RSS increased by 10X that. If GC
>>> was completely off, I still wouldn't expect this to happen, which makes me
>>> want to rule out GC being blocked as a problem. Maybe there was runaway
>>> heap reservation because something in my code caused a ton of
>>> fragmentation? Is that sane? The heap profile lacking clues is also strange.
>>>
>>> Regarding the possibility of a race, I forgot I do have goroutine
>>> profiles captured along with the heap profiles at the time memory exploded.
>>> There are only 10 goroutines running on the serving path, which rules out
>>> too many concurrent requests being served (please correct me if I'm wrong).
>>> Those fan out to 13 goroutines talking to the db, all of which are in
>>> http.Transport.roundTrip (which is blocked on runtime.gopark so I assume
>>> they are waiting on the TCP connection). All other goroutines that don't
>>> originate in the stdlib are also blocked on `select` or sleeping. Our CI
>>> does run with go test -race, but I'll try doing some load testing with a
>>> race detector enabled binary.
>>>
>>> Bakul, that is sound advice. I've actually been debugging this on and
>>> off for a couple months now, with the help of several people, a few of
>>> which have peer reviewed the code. I agree it's most likely 

Re: [go-nuts] OOM occurring with a small heap

2019-07-02 Thread 'Yunchi Luo' via golang-nuts
I did try to do that! I have 3 heap profiles captured from the ~3 seconds
before crash. The only thing particularly suspicious is the httptrace call
I mentioned earlier in the thread.

Diffing 1 to 2
(pprof) cum
(pprof) top 50
Showing nodes accounting for 4604.15kB, 81.69% of 5636.17kB total
  flat  flat%   sum%cum   cum%
 0 0% 0%  5120.16kB 90.84%
 net/http/httptrace.(*ClientTrace).compose.func1
 0 0% 0%  5120.16kB 90.84%  reflect.Value.Call
 0 0% 0%  5120.16kB 90.84%  reflect.Value.call
 0 0% 0%  5120.16kB 90.84%  reflect.callReflect
 5120.16kB 90.84% 90.84%  5120.16kB 90.84%  reflect.funcLayout.func1
 0 0% 90.84%  5120.16kB 90.84%  reflect.makeFuncStub
 0 0% 90.84%  4604.15kB 81.69%  sync.(*Pool).Get
 0 0% 90.84%  -516.01kB  9.16%  io.Copy
 0 0% 90.84%  -516.01kB  9.16%  io.copyBuffer
 0 0% 90.84%  -516.01kB  9.16%  io/ioutil.devNull.ReadFrom
 -516.01kB  9.16% 81.69%  -516.01kB  9.16%  io/ioutil.glob..func1

Diff 2 to 3
(pprof) top 50
Showing nodes accounting for 7680.44kB, 100% of 7680.44kB total
  flat  flat%   sum%cum   cum%
 0 0% 0%  6144.18kB 80.00%
 net/http/httptrace.(*ClientTrace).compose.func1
 0 0% 0%  6144.18kB 80.00%  reflect.Value.Call
 0 0% 0%  6144.18kB 80.00%  reflect.Value.call
  512.01kB  6.67%  6.67%  6144.18kB 80.00%  reflect.callReflect
 0 0%  6.67%  6144.18kB 80.00%  reflect.makeFuncStub
 5632.17kB 73.33% 80.00%  5632.17kB 73.33%  reflect.funcLayout.func1
 0 0% 80.00%  5632.17kB 73.33%  sync.(*Pool).Get
 0 0% 80.00%  1024.23kB 13.34%
github.com/aws/aws-sdk-go/aws/request.(*HandlerList).Run
 0 0% 80.00%  1024.23kB 13.34%
github.com/aws/aws-sdk-go/aws/request.(*Request).Send
 0 0% 80.00%  1024.23kB 13.34%
github.com/aws/aws-sdk-go/aws/request.(*Request).Sign
 0 0% 80.00%  1024.23kB 13.34%
github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).build
 0 0% 80.00%  1024.23kB 13.34%
github.com/aws/aws-sdk-go/aws/signer/v4.SignSDKRequest
 0 0% 80.00%  1024.23kB 13.34%
github.com/aws/aws-sdk-go/aws/signer/v4.SignSDKRequestWithCurrentTime
 0 0% 80.00%  1024.23kB 13.34%
github.com/aws/aws-sdk-go/aws/signer/v4.Signer.signWithBody
 0 0% 80.00%  1024.23kB 13.34%
github.com/aws/aws-sdk-go/service/dynamodb.(*DynamoDB).GetItemWithContext

On Tue, Jul 2, 2019 at 2:08 PM andrey mirtchovski 
wrote:

> What I have found useful in the past is pprof's ability to diff profiles.
> That means that if you capture heap profiles at regular intervals you can
> see a much smaller subset of changes and compare allocation patterns.
>
> On Tue, Jul 2, 2019, 10:53 AM 'Yunchi Luo' via golang-nuts <
> golang-nuts@googlegroups.com> wrote:
>
>> I'm not so much pointing my finger at GC as I am hoping GC logs could
>> help tell the story, and that someone with a strong understanding of GC in
>> Go could weigh in here. In the last 4 seconds before OOM, "TotalAlloc"
>> increased by only 80M, yet "HeapIdle" increased to 240M from 50M, RSS
>> increased by 810M. The numbers don't add up for me. A running sum of 80M of
>> heap objects were allocated in the time RSS increased by 10X that. If GC
>> was completely off, I still wouldn't expect this to happen, which makes me
>> want to rule out GC being blocked as a problem. Maybe there was runaway
>> heap reservation because something in my code caused a ton of
>> fragmentation? Is that sane? The heap profile lacking clues is also strange.
>>
>> Regarding the possibility of a race, I forgot I do have goroutine
>> profiles captured along with the heap profiles at the time memory exploded.
>> There are only 10 goroutines running on the serving path, which rules out
>> too many concurrent requests being served (please correct me if I'm wrong).
>> Those fan out to 13 goroutines talking to the db, all of which are in
>> http.Transport.roundTrip (which is blocked on runtime.gopark so I assume
>> they are waiting on the TCP connection). All other goroutines that don't
>> originate in the stdlib are also blocked on `select` or sleeping. Our CI
>> does run with go test -race, but I'll try doing some load testing with a
>> race detector enabled binary.
>>
>> Bakul, that is sound advice. I've actually been debugging this on and off
>> for a couple months now, with the help of several people, a few of which
>> have peer reviewed the code. I agree it's most likely to be some runaway
>> code that I caused in my logic, but we haven't been able to pin-point the
>> cause and I've run out of hypothesis to test at the moment. This is why
>> I've started asking on go-nuts@. The circuit breaker code was one of the
>> first things I checked, has been unit tested and verified working with load
>> tests. Now that you mention it, I actually did uncover a Go stdlib bug in
>> http2 while doing the 

Re: [go-nuts] OOM occurring with a small heap

2019-07-02 Thread Robert Engels
Did you try running on an older release of Go, like 1.10?

> On Jul 2, 2019, at 11:53 AM, 'Yunchi Luo' via golang-nuts 
>  wrote:
> 
> I'm not so much pointing my finger at GC as I am hoping GC logs could help 
> tell the story, and that someone with a strong understanding of GC in Go 
> could weigh in here. In the last 4 seconds before OOM, "TotalAlloc" increased 
> by only 80M, yet "HeapIdle" increased to 240M from 50M, RSS increased by 
> 810M. The numbers don't add up for me. A running sum of 80M of heap objects 
> were allocated in the time RSS increased by 10X that. If GC was completely 
> off, I still wouldn't expect this to happen, which makes me want to rule out 
> GC being blocked as a problem. Maybe there was runaway heap reservation 
> because something in my code caused a ton of fragmentation? Is that sane? The 
> heap profile lacking clues is also strange.
> 
> Regarding the possibility of a race, I forgot I do have goroutine profiles 
> captured along with the heap profiles at the time memory exploded. There are 
> only 10 goroutines running on the serving path, which rules out too many 
> concurrent requests being served (please correct me if I'm wrong). Those fan 
> out to 13 goroutines talking to the db, all of which are in 
> http.Transport.roundTrip (which is blocked on runtime.gopark so I assume they 
> are waiting on the TCP connection). All other goroutines that don't originate 
> in the stdlib are also blocked on `select` or sleeping. Our CI does run with 
> go test -race, but I'll try doing some load testing with a race detector 
> enabled binary.
> 
> Bakul, that is sound advice. I've actually been debugging this on and off for 
> a couple months now, with the help of several people, a few of which have 
> peer reviewed the code. I agree it's most likely to be some runaway code that 
> I caused in my logic, but we haven't been able to pin-point the cause and 
> I've run out of hypothesis to test at the moment. This is why I've started 
> asking on go-nuts@. The circuit breaker code was one of the first things I 
> checked, has been unit tested and verified working with load tests. Now that 
> you mention it, I actually did uncover a Go stdlib bug in http2 while doing 
> the load tests... but that's unrelated.
> 
> 
>> On Tue, Jul 2, 2019 at 2:24 AM Bakul Shah  wrote:
>> Before assuming it is the GC or something system related, you may wish to 
>> verify it is *not your own logic*. Larger RSS could also be due to your own 
>> logic touching more and more memory due to some runaway effect. The 
>> probability this has to do with GC is very low given the very widespread use 
>> of Go and the probability of a bug in new code is very high given it is used 
>> on a much much smaller scale. 
>> 
>> This has the "smell" of a concurrency bug. If I were you I'd test the code 
>> for any races, I'd review the code thoroughly with someone who doesn't know 
>> the code so that I'm forced to explain it, and I'd add plenty of assertions. 
>> I'd probably first look at the circuit breaker code -- things like how does 
>> it know how many concurrent connections exist?
>> 
>> In general, any hypothesis you come up with, you should have a test that 
>> *catches* the bug given the hypothesis. Elusive bugs tend to become more 
>> elusive as you are on the hunt and as you may fix other problems you 
>> discover on the way.
>> 
>> I even suspect you're looking at GC logs a bit too early. Instrument your 
>> own code and look at what patterns emerge. [Not to mention any time you 
>> spend on understanding your code will help improve your service; but better 
>> understanding of and debugging the GC won't necessarily help you!]
>> 
>>> On Jul 1, 2019, at 12:14 PM, 'Yunchi Luo' via golang-nuts 
>>>  wrote:
>>> 
>>> Hello, I'd like to solicit some help with a weird GC issue we are seeing.
>>> 
>>> I'm trying to debug OOM on a service we are running in k8s. The service is 
>>> just a CRUD server hitting a database (DynamoDB). Each replica serves about 
>>> 300 qps of traffic. There are no memory leaks. On occasion (seemingly 
>>> correlated to small latency spikes on the backend), the service would OOM. 
>>> This is surprising because it has a circuit breaker that drops requests 
>>> after 200 concurrent connections that has never trips, and goroutine 
>>> profiles confirm that there are nowhere 200 active goroutines.
>>> 
>>> GC logs are pasted below. It's interlaced with dumps of runtime.Memstats 
>>> (the RSS number is coming from /proc//stats). Go version is 1.12.5, 
>>> running an Alpine 3.10 container in an Amazon kernel 
>>> 4.14.123-111.109.amzn2.x86_64.
>>> 
>>> The service happily serves requests using ~50MB of RSS for hours, until the 
>>> last 2 seconds, where GC mark time starts to 2-4X per cycle 
>>> (43+489/158/0.60+0.021 ms cpu => 43+489/158/0.60+0.021 ms cpu), and RSS and 
>>> Sys blow up. It’s also interesting that in the last log line: `Sys=995MB 
>>> RSS=861MB HeapSys=199MB`. If I’m 

Re: [go-nuts] OOM occurring with a small heap

2019-07-02 Thread andrey mirtchovski
What I have found useful in the past is pprof's ability to diff profiles.
That means that if you capture heap profiles at regular intervals you can
see a much smaller subset of changes and compare allocation patterns.

On Tue, Jul 2, 2019, 10:53 AM 'Yunchi Luo' via golang-nuts <
golang-nuts@googlegroups.com> wrote:

> I'm not so much pointing my finger at GC as I am hoping GC logs could help
> tell the story, and that someone with a strong understanding of GC in Go
> could weigh in here. In the last 4 seconds before OOM, "TotalAlloc"
> increased by only 80M, yet "HeapIdle" increased to 240M from 50M, RSS
> increased by 810M. The numbers don't add up for me. A running sum of 80M of
> heap objects were allocated in the time RSS increased by 10X that. If GC
> was completely off, I still wouldn't expect this to happen, which makes me
> want to rule out GC being blocked as a problem. Maybe there was runaway
> heap reservation because something in my code caused a ton of
> fragmentation? Is that sane? The heap profile lacking clues is also strange.
>
> Regarding the possibility of a race, I forgot I do have goroutine profiles
> captured along with the heap profiles at the time memory exploded. There
> are only 10 goroutines running on the serving path, which rules out too
> many concurrent requests being served (please correct me if I'm wrong).
> Those fan out to 13 goroutines talking to the db, all of which are in
> http.Transport.roundTrip (which is blocked on runtime.gopark so I assume
> they are waiting on the TCP connection). All other goroutines that don't
> originate in the stdlib are also blocked on `select` or sleeping. Our CI
> does run with go test -race, but I'll try doing some load testing with a
> race detector enabled binary.
>
> Bakul, that is sound advice. I've actually been debugging this on and off
> for a couple months now, with the help of several people, a few of which
> have peer reviewed the code. I agree it's most likely to be some runaway
> code that I caused in my logic, but we haven't been able to pin-point the
> cause and I've run out of hypothesis to test at the moment. This is why
> I've started asking on go-nuts@. The circuit breaker code was one of the
> first things I checked, has been unit tested and verified working with load
> tests. Now that you mention it, I actually did uncover a Go stdlib bug in
> http2 while doing the load tests... but that's unrelated.
>
>
> On Tue, Jul 2, 2019 at 2:24 AM Bakul Shah  wrote:
>
>> Before assuming it is the GC or something system related, you may wish to
>> verify it is *not your own logic*. Larger RSS could also be due to your own
>> logic touching more and more memory due to some runaway effect. The
>> probability this has to do with GC is very low given the very widespread
>> use of Go and the probability of a bug in new code is very high given it is
>> used on a much much smaller scale.
>>
>> This has the "smell" of a concurrency bug. If I were you I'd test the
>> code for any races, I'd review the code thoroughly with someone who doesn't
>> know the code so that I'm forced to explain it, and I'd add plenty of
>> assertions. I'd probably first look at the circuit breaker code -- things
>> like how does it know how many concurrent connections exist?
>>
>> In general, any hypothesis you come up with, you should have a test that
>> *catches* the bug given the hypothesis. Elusive bugs tend to become more
>> elusive as you are on the hunt and as you may fix other problems you
>> discover on the way.
>>
>> I even suspect you're looking at GC logs a bit too early. Instrument your
>> own code and look at what patterns emerge. [Not to mention any time you
>> spend on understanding your code will help improve your service; but better
>> understanding of and debugging the GC won't necessarily help you!]
>>
>> On Jul 1, 2019, at 12:14 PM, 'Yunchi Luo' via golang-nuts <
>> golang-nuts@googlegroups.com> wrote:
>>
>> Hello, I'd like to solicit some help with a weird GC issue we are seeing.
>>
>> I'm trying to debug OOM on a service we are running in k8s. The service
>> is just a CRUD server hitting a database (DynamoDB). Each replica serves
>> about 300 qps of traffic. There are no memory leaks. On occasion (seemingly
>> correlated to small latency spikes on the backend), the service would OOM.
>> This is surprising because it has a circuit breaker that drops requests
>> after 200 concurrent connections that has never trips, and goroutine
>> profiles confirm that there are nowhere 200 active goroutines.
>>
>> GC logs are pasted below. It's interlaced with dumps of runtime.Memstats
>> (the RSS number is coming from /proc//stats). Go version is 1.12.5,
>> running an Alpine 3.10 container in an Amazon kernel
>> 4.14.123-111.109.amzn2.x86_64.
>>
>> The service happily serves requests using ~50MB of RSS for hours, until
>> the last 2 seconds, where GC mark time starts to 2-4X per cycle 
>> (43+489/158/0.60+0.021
>> ms cpu => 43+489/158/0.60+0.021 ms 

Re: [go-nuts] OOM occurring with a small heap

2019-07-02 Thread 'Yunchi Luo' via golang-nuts
I'm not so much pointing my finger at GC as I am hoping GC logs could help
tell the story, and that someone with a strong understanding of GC in Go
could weigh in here. In the last 4 seconds before OOM, "TotalAlloc"
increased by only 80M, yet "HeapIdle" increased to 240M from 50M, RSS
increased by 810M. The numbers don't add up for me. A running sum of 80M of
heap objects were allocated in the time RSS increased by 10X that. If GC
was completely off, I still wouldn't expect this to happen, which makes me
want to rule out GC being blocked as a problem. Maybe there was runaway
heap reservation because something in my code caused a ton of
fragmentation? Is that sane? The heap profile lacking clues is also strange.

Regarding the possibility of a race, I forgot I do have goroutine profiles
captured along with the heap profiles at the time memory exploded. There
are only 10 goroutines running on the serving path, which rules out too
many concurrent requests being served (please correct me if I'm wrong).
Those fan out to 13 goroutines talking to the db, all of which are in
http.Transport.roundTrip (which is blocked on runtime.gopark so I assume
they are waiting on the TCP connection). All other goroutines that don't
originate in the stdlib are also blocked on `select` or sleeping. Our CI
does run with go test -race, but I'll try doing some load testing with a
race detector enabled binary.

Bakul, that is sound advice. I've actually been debugging this on and off
for a couple months now, with the help of several people, a few of which
have peer reviewed the code. I agree it's most likely to be some runaway
code that I caused in my logic, but we haven't been able to pin-point the
cause and I've run out of hypothesis to test at the moment. This is why
I've started asking on go-nuts@. The circuit breaker code was one of the
first things I checked, has been unit tested and verified working with load
tests. Now that you mention it, I actually did uncover a Go stdlib bug in
http2 while doing the load tests... but that's unrelated.


On Tue, Jul 2, 2019 at 2:24 AM Bakul Shah  wrote:

> Before assuming it is the GC or something system related, you may wish to
> verify it is *not your own logic*. Larger RSS could also be due to your own
> logic touching more and more memory due to some runaway effect. The
> probability this has to do with GC is very low given the very widespread
> use of Go and the probability of a bug in new code is very high given it is
> used on a much much smaller scale.
>
> This has the "smell" of a concurrency bug. If I were you I'd test the code
> for any races, I'd review the code thoroughly with someone who doesn't know
> the code so that I'm forced to explain it, and I'd add plenty of
> assertions. I'd probably first look at the circuit breaker code -- things
> like how does it know how many concurrent connections exist?
>
> In general, any hypothesis you come up with, you should have a test that
> *catches* the bug given the hypothesis. Elusive bugs tend to become more
> elusive as you are on the hunt and as you may fix other problems you
> discover on the way.
>
> I even suspect you're looking at GC logs a bit too early. Instrument your
> own code and look at what patterns emerge. [Not to mention any time you
> spend on understanding your code will help improve your service; but better
> understanding of and debugging the GC won't necessarily help you!]
>
> On Jul 1, 2019, at 12:14 PM, 'Yunchi Luo' via golang-nuts <
> golang-nuts@googlegroups.com> wrote:
>
> Hello, I'd like to solicit some help with a weird GC issue we are seeing.
>
> I'm trying to debug OOM on a service we are running in k8s. The service is
> just a CRUD server hitting a database (DynamoDB). Each replica serves about
> 300 qps of traffic. There are no memory leaks. On occasion (seemingly
> correlated to small latency spikes on the backend), the service would OOM.
> This is surprising because it has a circuit breaker that drops requests
> after 200 concurrent connections that has never trips, and goroutine
> profiles confirm that there are nowhere 200 active goroutines.
>
> GC logs are pasted below. It's interlaced with dumps of runtime.Memstats
> (the RSS number is coming from /proc//stats). Go version is 1.12.5,
> running an Alpine 3.10 container in an Amazon kernel
> 4.14.123-111.109.amzn2.x86_64.
>
> The service happily serves requests using ~50MB of RSS for hours, until
> the last 2 seconds, where GC mark time starts to 2-4X per cycle 
> (43+489/158/0.60+0.021
> ms cpu => 43+489/158/0.60+0.021 ms cpu), and RSS and Sys blow up. It’s
> also interesting that in the last log line: `Sys=995MB RSS=861MB
> HeapSys=199MB`. If I’m reading this correctly, there’s at least `662MB` of
> memory in RSS that is not assigned to the heap. Though this might be due to
> the change in 1.125 to use MADV_FREE, so the pages are freeable not yet
> reclaimed by the kernel.
>
> I don’t understand how heap can be so small across gc cycles 

Re: [go-nuts] OOM occurring with a small heap

2019-07-02 Thread Mighty Guava
I don't understand. What would adding runtime.Goscheduled() do here? I don't 
have any explicit loops in this service.
On Jul 1, 2019, 9:11 PM -0400, Michael Jones , wrote:
> Does adding runtime.GoSched() calls make any difference?
>
> > On Mon, Jul 1, 2019 at 5:37 PM 'Yunchi Luo' via golang-nuts 
> >  wrote:
> > > Following that logic, a leak of TCP connections should manifest as a file 
> > > descriptor leak. We have the process_open_fds metric from Prometheus, 
> > > that is the number of open file descriptors as found in /proc//fd. 
> > > The number of descriptors overtime correlates well with the amount of 
> > > traffic, pretty cyclic. There doesn't appear to be a leak.
> > >
> > > We don't do our own memory management and the binary is compiled with 
> > > CGO_ENABLED=0.
> > >
> > > I still think the issue I'm seeing should be GC (or heap) related, given 
> > > the explosion in mark & sweep time, HeapSys, HeapIdle, and HeapReleased 
> > > just before the process dies. But I'm lacking ideas on how to track down 
> > > the cause of the increase.
> > >
> > > > On Mon, Jul 1, 2019 at 5:39 PM Robert Engels  
> > > > wrote:
> > > > > I think don't think you are going to find it in the 'heap', rather it 
> > > > > would be in native memory.
> > > > >
> > > > > I would use the monitor the /proc/[pid] for the process, and pay 
> > > > > attention to the 'fd','net' and 'statm' - if my theory is correct you 
> > > > > will see growth here long before the process is killed. Since you are 
> > > > > running under k8s and cgroups, you will need to do this along side 
> > > > > the Go process (unless you have root access to the server).
> > > > >
> > > > > I 'think' depending on kernel version, that kernel memory used goes 
> > > > > against the process for OOM purposes, so this is a likely candidate 
> > > > > if pprof is showing nothing.
> > > > >
> > > > > Do you by chance do any of your own memory management (via 
> > > > > malloc/CGO)? If so, this is not going to show in pprof either.
> > > > > > -Original Message-
> > > > > > From: 'Yunchi Luo' via golang-nuts
> > > > > > Sent: Jul 1, 2019 4:26 PM
> > > > > > To: Robert Engels
> > > > > > Cc: golang-nuts@googlegroups.com, Alec Thomas
> > > > > > Subject: Re: [go-nuts] OOM occurring with a small heap
> > > > > >
> > > > > > I actually have a heap profile (pasted at the bottom) from about 1 
> > > > > > second before the service died (the goroutine that is logging 
> > > > > > "[Memory]" triggers heap profiles once RSS > 100MB). I don't see 
> > > > > > TCP connections there. Maybe it's too few to be sampled. How would 
> > > > > > I verify your theory? That the service dies within 2 seconds after 
> > > > > > several hours makes it very hard to debug.
> > > > > >
> > > > > > The top thing in the heap profile is from the reflect package. I 
> > > > > > initially found that suspect, but it turns out this comes from a 
> > > > > > use of httptrace.ClientTrace I had for counting new connections to 
> > > > > > DynamoDB.
> > > > > >
> > > > > >  tracer := {
> > > > > >    ConnectStart: func(_, _ string) {
> > > > > >      newConns.Inc()
> > > > > >    },
> > > > > > }
> > > > > >
> > > > > > `newConns` is just a prometheus counter. The `tracer` object itself 
> > > > > > is created once and re-used with every client request context. On 
> > > > > > request, `httptrace.WithClientTrace(ctx, tracer)` uses reflection 
> > > > > > to compose the trace functions under-the-hood and uses reflection 
> > > > > > to invoke it, hence the reflect.funcLayout and reflect.Value.call. 
> > > > > > Objects in `reflect` account for about 50% of heap in terms of 
> > > > > > size, and does seem to grow as the service is running out of 
> > > > > > memory, but that's only 12MB so I thought it was a red herring.
> > > > > >
> > > > > > Heap profile:
> > > > > > Type: inuse_space
> > > > > > Time: Jun 30, 2019 at 4:46a

Re: [go-nuts] OOM occurring with a small heap

2019-07-02 Thread Mighty Guava
I don't think GC is being blocked. GC ran multiple times during the time the 
service was exploding in memory.

I'm not using mmap in the service. It's just a straightforward CRUD webserver 
backed by DynamoDB. The only thing non-standard I can think of is that it 
connects to its backing database over HTTP/1.1 (using AWS Go SDK) instead of a 
custom database protocol. It's serving a fair amount of requests (500 qps per 
replica), so I initially assumed it was under-provisioned to handle latency 
spikes. But I tripled the memory and it's still OOMing on occasion, even though 
RSS is 9% of allocated memory just seconds before.

Yunchi
On Jul 2, 2019, 12:00 AM -0400, Robert Engels , wrote:
> Does your process use mmap? Maybe you are leaking there, as this counts 
> against process memory size.
>
> On Jul 1, 2019, at 9:11 PM, Mighty Guava  wrote:
>
> > I don't understand. What would adding runtime.Goscheduled() do here? I 
> > don't have any explicit loops in this service.
> > On Jul 1, 2019, 9:11 PM -0400, Michael Jones , 
> > wrote:
> > > Does adding runtime.GoSched() calls make any difference?
> > >
> > > > On Mon, Jul 1, 2019 at 5:37 PM 'Yunchi Luo' via golang-nuts 
> > > >  wrote:
> > > > > Following that logic, a leak of TCP connections should manifest as a 
> > > > > file descriptor leak. We have the process_open_fds metric from 
> > > > > Prometheus, that is the number of open file descriptors as found in 
> > > > > /proc//fd. The number of descriptors overtime correlates well 
> > > > > with the amount of traffic, pretty cyclic. There doesn't appear to be 
> > > > > a leak.
> > > > >
> > > > > We don't do our own memory management and the binary is compiled with 
> > > > > CGO_ENABLED=0.
> > > > >
> > > > > I still think the issue I'm seeing should be GC (or heap) related, 
> > > > > given the explosion in mark & sweep time, HeapSys, HeapIdle, and 
> > > > > HeapReleased just before the process dies. But I'm lacking ideas on 
> > > > > how to track down the cause of the increase.
> > > > >
> > > > > > On Mon, Jul 1, 2019 at 5:39 PM Robert Engels 
> > > > > >  wrote:
> > > > > > > I think don't think you are going to find it in the 'heap', 
> > > > > > > rather it would be in native memory.
> > > > > > >
> > > > > > > I would use the monitor the /proc/[pid] for the process, and pay 
> > > > > > > attention to the 'fd','net' and 'statm' - if my theory is correct 
> > > > > > > you will see growth here long before the process is killed. Since 
> > > > > > > you are running under k8s and cgroups, you will need to do this 
> > > > > > > along side the Go process (unless you have root access to the 
> > > > > > > server).
> > > > > > >
> > > > > > > I 'think' depending on kernel version, that kernel memory used 
> > > > > > > goes against the process for OOM purposes, so this is a likely 
> > > > > > > candidate if pprof is showing nothing.
> > > > > > >
> > > > > > > Do you by chance do any of your own memory management (via 
> > > > > > > malloc/CGO)? If so, this is not going to show in pprof either.
> > > > > > > > -Original Message-
> > > > > > > > From: 'Yunchi Luo' via golang-nuts
> > > > > > > > Sent: Jul 1, 2019 4:26 PM
> > > > > > > > To: Robert Engels
> > > > > > > > Cc: golang-nuts@googlegroups.com, Alec Thomas
> > > > > > > > Subject: Re: [go-nuts] OOM occurring with a small heap
> > > > > > > >
> > > > > > > > I actually have a heap profile (pasted at the bottom) from 
> > > > > > > > about 1 second before the service died (the goroutine that is 
> > > > > > > > logging "[Memory]" triggers heap profiles once RSS > 100MB). I 
> > > > > > > > don't see TCP connections there. Maybe it's too few to be 
> > > > > > > > sampled. How would I verify your theory? That the service dies 
> > > > > > > > within 2 seconds after several hours makes it very hard to 
> > > > > > > > debug.
> > > > > > > >
> > > > > > > > The top thing in th

Re: [go-nuts] OOM occurring with a small heap

2019-07-02 Thread Mighty Guava
Oh now that you mention it, I'm using grpc-gateway with the stdlib http server. 
Though all it really does is convert json <=> protobuf and is pretty mature so 
it's unlikely to be the culprit.

Running with race detector is a good idea! I'll try that tomorrow.

Thanks,
Yunchi Luo
On Jul 2, 2019, 12:25 AM -0400, robert engels , wrote:
> Are you using anything like fast-http, or such? Have you tried running the 
> server with the race-detector enabled? I could see a race condition causing a 
> rare failure that causes the code to go into a ‘spinning loop’ preventing GC 
> to work properly. It’s a guess but I would try that to rule out the 
> possibility.
>
> > On Jul 1, 2019, at 11:06 PM, Mighty Guava  wrote:
> >
> > I don't think GC is being blocked. GC ran multiple times during the time 
> > the service was exploding in memory.
> >
> > I'm not using mmap in the service. It's just a straightforward CRUD 
> > webserver backed by DynamoDB. The only thing non-standard I can think of is 
> > that it connects to its backing database over HTTP/1.1 (using AWS Go SDK) 
> > instead of a custom database protocol. It's serving a fair amount of 
> > requests (500 qps per replica), so I initially assumed it was 
> > under-provisioned to handle latency spikes. But I tripled the memory and 
> > it's still OOMing on occasion, even though RSS is 9% of allocated memory 
> > just seconds before.
> >
> > Yunchi
> > On Jul 2, 2019, 12:00 AM -0400, Robert Engels , 
> > wrote:
> > > Does your process use mmap? Maybe you are leaking there, as this counts 
> > > against process memory size.
> > >
> > > On Jul 1, 2019, at 9:11 PM, Mighty Guava  wrote:
> > >
> > > > I don't understand. What would adding runtime.Goscheduled() do here? I 
> > > > don't have any explicit loops in this service.
> > > > On Jul 1, 2019, 9:11 PM -0400, Michael Jones , 
> > > > wrote:
> > > > > Does adding runtime.GoSched() calls make any difference?
> > > > >
> > > > > > On Mon, Jul 1, 2019 at 5:37 PM 'Yunchi Luo' via golang-nuts 
> > > > > >  wrote:
> > > > > > > Following that logic, a leak of TCP connections should manifest 
> > > > > > > as a file descriptor leak. We have the process_open_fds metric 
> > > > > > > from Prometheus, that is the number of open file descriptors as 
> > > > > > > found in /proc//fd. The number of descriptors overtime 
> > > > > > > correlates well with the amount of traffic, pretty cyclic. There 
> > > > > > > doesn't appear to be a leak.
> > > > > > >
> > > > > > > We don't do our own memory management and the binary is compiled 
> > > > > > > with CGO_ENABLED=0.
> > > > > > >
> > > > > > > I still think the issue I'm seeing should be GC (or heap) 
> > > > > > > related, given the explosion in mark & sweep time, HeapSys, 
> > > > > > > HeapIdle, and HeapReleased just before the process dies. But I'm 
> > > > > > > lacking ideas on how to track down the cause of the increase.
> > > > > > >
> > > > > > > > On Mon, Jul 1, 2019 at 5:39 PM Robert Engels 
> > > > > > > >  wrote:
> > > > > > > > > I think don't think you are going to find it in the 'heap', 
> > > > > > > > > rather it would be in native memory.
> > > > > > > > >
> > > > > > > > > I would use the monitor the /proc/[pid] for the process, and 
> > > > > > > > > pay attention to the 'fd','net' and 'statm' - if my theory is 
> > > > > > > > > correct you will see growth here long before the process is 
> > > > > > > > > killed. Since you are running under k8s and cgroups, you will 
> > > > > > > > > need to do this along side the Go process (unless you have 
> > > > > > > > > root access to the server).
> > > > > > > > >
> > > > > > > > > I 'think' depending on kernel version, that kernel memory 
> > > > > > > > > used goes against the process for OOM purposes, so this is a 
> > > > > > > > > likely candidate if pprof is showing nothing.
> > > > > > > > >
> > > > > > > > > Do you by chance do any of your own memor

Re: [go-nuts] OOM occurring with a small heap

2019-07-02 Thread Michael Jones
(I don't suspect GC either, it was a rule it out suggestion)

On Mon, Jul 1, 2019 at 10:20 PM robert engels  wrote:

> One other thing to think about - 500 qps is a lot for a single server if
> things start blocking and you have a low memory cap. Imagine the scenario
> where the DynamoDB “locks up / heavy delays due to IO contention /
> something”, depending on how your server is written/configured it may still
> keep accepting inbound connections, each potentially causing a thread to be
> created (if they are doing blocking IO). If there is not a low cap on the
> number of inbound connections you could quickly run out of memory. You
> might want to see if there is a configuration option on grpc-gateway to
> limit the maximum number of simultaneous inbound connections, set that to a
> low number, and see if it still runs out of memory.
>
> This scenario is unlikely if you didn’t see the TCPConn instances in the
> heap dump.
>
> On Jul 1, 2019, at 11:28 PM, Mighty Guava  wrote:
>
> Oh now that you mention it, I'm using grpc-gateway with the stdlib http
> server. Though all it really does is convert json <=> protobuf and is
> pretty mature so it's unlikely to be the culprit.
>
> Running with race detector is a good idea! I'll try that tomorrow.
>
> Thanks,
> Yunchi Luo
> On Jul 2, 2019, 12:25 AM -0400, robert engels ,
> wrote:
>
> Are you using anything like fast-http, or such? Have you tried running the
> server with the race-detector enabled? I could see a race condition causing
> a rare failure that causes the code to go into a ‘spinning loop’ preventing
> GC to work properly. It’s a guess but I would try that to rule out the
> possibility.
>
> On Jul 1, 2019, at 11:06 PM, Mighty Guava  wrote:
>
> I don't think GC is being blocked. GC ran multiple times during the time
> the service was exploding in memory.
>
> I'm not using mmap in the service. It's just a straightforward CRUD
> webserver backed by DynamoDB. The only thing non-standard I can think of is
> that it connects to its backing database over HTTP/1.1 (using AWS Go SDK)
> instead of a custom database protocol. It's serving a fair amount of
> requests (500 qps per replica), so I initially assumed it was
> under-provisioned to handle latency spikes. But I tripled the memory and
> it's still OOMing on occasion, even though RSS is 9% of allocated memory
> just seconds before.
>
> Yunchi
> On Jul 2, 2019, 12:00 AM -0400, Robert Engels ,
> wrote:
>
> Does your process use mmap? Maybe you are leaking there, as this counts
> against process memory size.
>
> On Jul 1, 2019, at 9:11 PM, Mighty Guava  wrote:
>
> I don't understand. What would adding runtime.Goscheduled() do here? I
> don't have any explicit loops in this service.
> On Jul 1, 2019, 9:11 PM -0400, Michael Jones ,
> wrote:
>
> Does adding runtime.GoSched() calls make any difference?
>
> On Mon, Jul 1, 2019 at 5:37 PM 'Yunchi Luo' via golang-nuts <
> golang-nuts@googlegroups.com> wrote:
>
> Following that logic, a leak of TCP connections should manifest as a file
> descriptor leak. We have the process_open_fds metric from Prometheus, that
> is the number of open file descriptors as found in /proc//fd. The
> number of descriptors overtime correlates well with the amount of traffic,
> pretty cyclic. There doesn't appear to be a leak.
>
> We don't do our own memory management and the binary is compiled with
> CGO_ENABLED=0.
>
> I still think the issue I'm seeing should be GC (or heap) related, given
> the explosion in mark & sweep time, HeapSys, HeapIdle, and HeapReleased
> just before the process dies. But I'm lacking ideas on how to track down
> the cause of the increase.
>
> On Mon, Jul 1, 2019 at 5:39 PM Robert Engels 
> wrote:
>
> I think don't think you are going to find it in the 'heap', rather it
> would be in native memory.
>
> I would use the monitor the /proc/[pid] for the process, and pay attention
> to the 'fd','net' and 'statm' - if my theory is correct you will see growth
> here long before the process is killed. Since you are running under k8s and
> cgroups, you will need to do this along side the Go process (unless you
> have root access to the server).
>
> I 'think' depending on kernel version, that kernel memory used goes
> against the process for OOM purposes, so this is a likely candidate if
> pprof is showing nothing.
>
> Do you by chance do any of your own memory management (via malloc/CGO)? If
> so, this is not going to show in pprof either.
>
> -Original Message-
> From: 'Yunchi Luo' via golang-nuts
> Sent: Jul 1, 2019 4:26 PM
> To: Robert Engels
> Cc: golang-nuts@googlegroups.com, Alec Thomas
> Subject: Re: [go-nuts] OOM occurring wit

Re: [go-nuts] OOM occurring with a small heap

2019-07-01 Thread robert engels
nd pay 
>>>>>> attention to the 'fd','net' and 'statm' - if my theory is correct you 
>>>>>> will see growth here long before the process is killed. Since you are 
>>>>>> running under k8s and cgroups, you will need to do this along side the 
>>>>>> Go process (unless you have root access to the server).
>>>>>> 
>>>>>> I 'think' depending on kernel version, that kernel memory used goes 
>>>>>> against the process for OOM purposes, so this is a likely candidate if 
>>>>>> pprof is showing nothing.
>>>>>> 
>>>>>> Do you by chance do any of your own memory management (via malloc/CGO)? 
>>>>>> If so, this is not going to show in pprof either.
>>>>>> -Original Message-
>>>>>> From: 'Yunchi Luo' via golang-nuts 
>>>>>> Sent: Jul 1, 2019 4:26 PM
>>>>>> To: Robert Engels 
>>>>>> Cc: golang-nuts@googlegroups.com <mailto:golang-nuts@googlegroups.com>, 
>>>>>> Alec Thomas 
>>>>>> Subject: Re: [go-nuts] OOM occurring with a small heap
>>>>>> 
>>>>>> I actually have a heap profile (pasted at the bottom) from about 1 
>>>>>> second before the service died (the goroutine that is logging "[Memory]" 
>>>>>> triggers heap profiles once RSS > 100MB). I don't see TCP connections 
>>>>>> there. Maybe it's too few to be sampled. How would I verify your theory? 
>>>>>> That the service dies within 2 seconds after several hours makes it very 
>>>>>> hard to debug.
>>>>>> 
>>>>>> The top thing in the heap profile is from the reflect package. I 
>>>>>> initially found that suspect, but it turns out this comes from a use of 
>>>>>> httptrace.ClientTrace I had for counting new connections to DynamoDB.
>>>>>> 
>>>>>>  tracer := {
>>>>>>ConnectStart: func(_, _ string) {
>>>>>>  newConns.Inc()
>>>>>>},
>>>>>> }
>>>>>> 
>>>>>> `newConns` is just a prometheus counter. The `tracer` object itself is 
>>>>>> created once and re-used with every client request context. On request, 
>>>>>> `httptrace.WithClientTrace(ctx, tracer)` uses reflection to compose the 
>>>>>> trace functions under-the-hood and uses reflection to invoke it, hence 
>>>>>> the reflect.funcLayout and reflect.Value.call. Objects in `reflect` 
>>>>>> account for about 50% of heap in terms of size, and does seem to grow as 
>>>>>> the service is running out of memory, but that's only 12MB so I thought 
>>>>>> it was a red herring.
>>>>>> 
>>>>>> Heap profile:
>>>>>> Type: inuse_space
>>>>>> Time: Jun 30, 2019 at 4:46am (EDT)
>>>>>> Entering interactive mode (type "help" for commands, "o" for options)
>>>>>> (pprof) inuse_objects
>>>>>> (pprof) top
>>>>>> Showing nodes accounting for 414485, 100% of 414485 total
>>>>>> Showing top 10 nodes out of 81
>>>>>>   flat  flat%   sum%cum   cum%
>>>>>> 344074 83.01% 83.01% 344074 83.01%  reflect.funcLayout.func1
>>>>>>  32768  7.91% 90.92% 376842 90.92%  reflect.callReflect
>>>>>>  16384  3.95% 94.87%  16384  3.95%  
>>>>>> github.com/json-iterator/go.processTags 
>>>>>> <http://github.com/json-iterator/go.processTags>
>>>>>>  10923  2.64% 97.51%  10923  2.64%  context.WithValue
>>>>>>   8192  1.98% 99.48%   8192  1.98%  crypto/hmac.New
>>>>>>   1260   0.3% 99.79%   1260   0.3%  
>>>>>> github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).buildCanonicalHeaders
>>>>>>  
>>>>>> <http://github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).buildCanonicalHeaders>
>>>>>>820   0.2%   100%820   0.2%  
>>>>>> github.com/stripe/veneur/tdigest.NewMerging 
>>>>>> <http://github.com/stripe/veneur/tdigest.NewMerging>
>>>>>> 64 0.015%   100% 64 0.015%  reflect.addReflectOff
>>>>>>  0 0%   100%820   0.2%  
>>>&g

Re: [go-nuts] OOM occurring with a small heap

2019-07-01 Thread robert engels
Are you using anything like fast-http, or such? Have you tried running the 
server with the race-detector enabled? I could see a race condition causing a 
rare failure that causes the code to go into a ‘spinning loop’ preventing GC to 
work properly. It’s a guess but I would try that to rule out the possibility.

> On Jul 1, 2019, at 11:06 PM, Mighty Guava  wrote:
> 
> I don't think GC is being blocked. GC ran multiple times during the time the 
> service was exploding in memory. 
> 
> I'm not using mmap in the service. It's just a straightforward CRUD webserver 
> backed by DynamoDB. The only thing non-standard I can think of is that it 
> connects to its backing database over HTTP/1.1 (using AWS Go SDK) instead of 
> a custom database protocol. It's serving a fair amount of requests (500 qps 
> per replica), so I initially assumed it was under-provisioned to handle 
> latency spikes. But I tripled the memory and it's still OOMing on occasion, 
> even though RSS is 9% of allocated memory just seconds before. 
> 
> Yunchi
> On Jul 2, 2019, 12:00 AM -0400, Robert Engels , wrote:
>> Does your process use mmap? Maybe you are leaking there, as this counts 
>> against process memory size. 
>> 
>> On Jul 1, 2019, at 9:11 PM, Mighty Guava > <mailto:mightygu...@gmail.com>> wrote:
>> 
>>> I don't understand. What would adding runtime.Goscheduled() do here? I 
>>> don't have any explicit loops in this service.
>>> On Jul 1, 2019, 9:11 PM -0400, Michael Jones >> <mailto:michael.jo...@gmail.com>>, wrote:
>>>> Does adding runtime.GoSched() calls make any difference?
>>>> 
>>>> On Mon, Jul 1, 2019 at 5:37 PM 'Yunchi Luo' via golang-nuts 
>>>> mailto:golang-nuts@googlegroups.com>> wrote:
>>>> Following that logic, a leak of TCP connections should manifest as a file 
>>>> descriptor leak. We have the process_open_fds metric from Prometheus, that 
>>>> is the number of open file descriptors as found in /proc//fd. The 
>>>> number of descriptors overtime correlates well with the amount of traffic, 
>>>> pretty cyclic. There doesn't appear to be a leak.
>>>> 
>>>> We don't do our own memory management and the binary is compiled with 
>>>> CGO_ENABLED=0.
>>>> 
>>>> I still think the issue I'm seeing should be GC (or heap) related, given 
>>>> the explosion in mark & sweep time, HeapSys, HeapIdle, and HeapReleased 
>>>> just before the process dies. But I'm lacking ideas on how to track down 
>>>> the cause of the increase.
>>>> 
>>>> On Mon, Jul 1, 2019 at 5:39 PM Robert Engels >>> <mailto:reng...@ix.netcom.com>> wrote:
>>>> I think don't think you are going to find it in the 'heap', rather it 
>>>> would be in native memory.
>>>> 
>>>> I would use the monitor the /proc/[pid] for the process, and pay attention 
>>>> to the 'fd','net' and 'statm' - if my theory is correct you will see 
>>>> growth here long before the process is killed. Since you are running under 
>>>> k8s and cgroups, you will need to do this along side the Go process 
>>>> (unless you have root access to the server).
>>>> 
>>>> I 'think' depending on kernel version, that kernel memory used goes 
>>>> against the process for OOM purposes, so this is a likely candidate if 
>>>> pprof is showing nothing.
>>>> 
>>>> Do you by chance do any of your own memory management (via malloc/CGO)? If 
>>>> so, this is not going to show in pprof either.
>>>> -Original Message-
>>>> From: 'Yunchi Luo' via golang-nuts 
>>>> Sent: Jul 1, 2019 4:26 PM
>>>> To: Robert Engels 
>>>> Cc: golang-nuts@googlegroups.com <mailto:golang-nuts@googlegroups.com>, 
>>>> Alec Thomas 
>>>> Subject: Re: [go-nuts] OOM occurring with a small heap
>>>> 
>>>> I actually have a heap profile (pasted at the bottom) from about 1 second 
>>>> before the service died (the goroutine that is logging "[Memory]" triggers 
>>>> heap profiles once RSS > 100MB). I don't see TCP connections there. Maybe 
>>>> it's too few to be sampled. How would I verify your theory? That the 
>>>> service dies within 2 seconds after several hours makes it very hard to 
>>>> debug.
>>>> 
>>>> The top thing in the heap profile is from the reflect package. I initially 
>>>> found that suspect, but it turns out this comes from a use of 
>

Re: [go-nuts] OOM occurring with a small heap

2019-07-01 Thread Robert Engels
Does your process use mmap? Maybe you are leaking there, as this counts against 
process memory size. 

> On Jul 1, 2019, at 9:11 PM, Mighty Guava  wrote:
> 
> I don't understand. What would adding runtime.Goscheduled() do here? I don't 
> have any explicit loops in this service.
>> On Jul 1, 2019, 9:11 PM -0400, Michael Jones , 
>> wrote:
>> Does adding runtime.GoSched() calls make any difference?
>> 
>> On Mon, Jul 1, 2019 at 5:37 PM 'Yunchi Luo' via golang-nuts 
>>  wrote:
>> Following that logic, a leak of TCP connections should manifest as a file 
>> descriptor leak. We have the process_open_fds metric from Prometheus, that 
>> is the number of open file descriptors as found in /proc//fd. The 
>> number of descriptors overtime correlates well with the amount of traffic, 
>> pretty cyclic. There doesn't appear to be a leak.
>> 
>> We don't do our own memory management and the binary is compiled with 
>> CGO_ENABLED=0.
>> 
>> I still think the issue I'm seeing should be GC (or heap) related, given the 
>> explosion in mark & sweep time, HeapSys, HeapIdle, and HeapReleased just 
>> before the process dies. But I'm lacking ideas on how to track down the 
>> cause of the increase.
>> 
>> On Mon, Jul 1, 2019 at 5:39 PM Robert Engels  wrote:
>> I think don't think you are going to find it in the 'heap', rather it would 
>> be in native memory.
>> 
>> I would use the monitor the /proc/[pid] for the process, and pay attention 
>> to the 'fd','net' and 'statm' - if my theory is correct you will see growth 
>> here long before the process is killed. Since you are running under k8s and 
>> cgroups, you will need to do this along side the Go process (unless you have 
>> root access to the server).
>> 
>> I 'think' depending on kernel version, that kernel memory used goes against 
>> the process for OOM purposes, so this is a likely candidate if pprof is 
>> showing nothing.
>> 
>> Do you by chance do any of your own memory management (via malloc/CGO)? If 
>> so, this is not going to show in pprof either.
>> -Original Message-
>> From: 'Yunchi Luo' via golang-nuts 
>> Sent: Jul 1, 2019 4:26 PM
>> To: Robert Engels 
>> Cc: golang-nuts@googlegroups.com, Alec Thomas 
>> Subject: Re: [go-nuts] OOM occurring with a small heap
>> 
>> I actually have a heap profile (pasted at the bottom) from about 1 second 
>> before the service died (the goroutine that is logging "[Memory]" triggers 
>> heap profiles once RSS > 100MB). I don't see TCP connections there. Maybe 
>> it's too few to be sampled. How would I verify your theory? That the service 
>> dies within 2 seconds after several hours makes it very hard to debug.
>> 
>> The top thing in the heap profile is from the reflect package. I initially 
>> found that suspect, but it turns out this comes from a use of 
>> httptrace.ClientTrace I had for counting new connections to DynamoDB.
>> 
>>  tracer := {
>>ConnectStart: func(_, _ string) {
>>  newConns.Inc()
>>},
>> }
>> 
>> `newConns` is just a prometheus counter. The `tracer` object itself is 
>> created once and re-used with every client request context. On request, 
>> `httptrace.WithClientTrace(ctx, tracer)` uses reflection to compose the 
>> trace functions under-the-hood and uses reflection to invoke it, hence the 
>> reflect.funcLayout and reflect.Value.call. Objects in `reflect` account for 
>> about 50% of heap in terms of size, and does seem to grow as the service is 
>> running out of memory, but that's only 12MB so I thought it was a red 
>> herring.
>> 
>> Heap profile:
>> Type: inuse_space
>> Time: Jun 30, 2019 at 4:46am (EDT)
>> Entering interactive mode (type "help" for commands, "o" for options)
>> (pprof) inuse_objects
>> (pprof) top
>> Showing nodes accounting for 414485, 100% of 414485 total
>> Showing top 10 nodes out of 81
>>   flat  flat%   sum%cum   cum%
>> 344074 83.01% 83.01% 344074 83.01%  reflect.funcLayout.func1
>>  32768  7.91% 90.92% 376842 90.92%  reflect.callReflect
>>  16384  3.95% 94.87%  16384  3.95%  
>> github.com/json-iterator/go.processTags
>>  10923  2.64% 97.51%  10923  2.64%  context.WithValue
>>   8192  1.98% 99.48%   8192  1.98%  crypto/hmac.New
>>   1260   0.3% 99.79%   1260   0.3%  
>> github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).buildCanonicalHeaders
>>820   0.2%   100%820   0.2%  
>> github.com/stripe/v

Re: [go-nuts] OOM occurring with a small heap

2019-07-01 Thread Michael Jones
One theory you advanced is of a desperate garbage collector. If you prevent
it from running all along, then maybe that could be possible. Pursuing that
line of reasoning I though you could ensure it’s chances with gosched() in
your code.

On Mon, Jul 1, 2019 at 7:17 PM Mighty Guava  wrote:

> I don't understand. What would adding runtime.Goscheduled() do here? I
> don't have any explicit loops in this service.
> On Jul 1, 2019, 9:11 PM -0400, Michael Jones ,
> wrote:
>
> Does adding runtime.GoSched() calls make any difference?
>
> On Mon, Jul 1, 2019 at 5:37 PM 'Yunchi Luo' via golang-nuts <
> golang-nuts@googlegroups.com> wrote:
>
> Following that logic, a leak of TCP connections should manifest as a file
> descriptor leak. We have the process_open_fds metric from Prometheus, that
> is the number of open file descriptors as found in /proc//fd. The
> number of descriptors overtime correlates well with the amount of traffic,
> pretty cyclic. There doesn't appear to be a leak.
>
> We don't do our own memory management and the binary is compiled with
> CGO_ENABLED=0.
>
> I still think the issue I'm seeing should be GC (or heap) related, given
> the explosion in mark & sweep time, HeapSys, HeapIdle, and HeapReleased
> just before the process dies. But I'm lacking ideas on how to track down
> the cause of the increase.
>
> On Mon, Jul 1, 2019 at 5:39 PM Robert Engels 
> wrote:
>
> I think don't think you are going to find it in the 'heap', rather it
> would be in native memory.
>
> I would use the monitor the /proc/[pid] for the process, and pay attention
> to the 'fd','net' and 'statm' - if my theory is correct you will see growth
> here long before the process is killed. Since you are running under k8s and
> cgroups, you will need to do this along side the Go process (unless you
> have root access to the server).
>
> I 'think' depending on kernel version, that kernel memory used goes
> against the process for OOM purposes, so this is a likely candidate if
> pprof is showing nothing.
>
> Do you by chance do any of your own memory management (via malloc/CGO)? If
> so, this is not going to show in pprof either.
>
> -Original Message-
> From: 'Yunchi Luo' via golang-nuts
> Sent: Jul 1, 2019 4:26 PM
> To: Robert Engels
> Cc: golang-nuts@googlegroups.com, Alec Thomas
> Subject: Re: [go-nuts] OOM occurring with a small heap
>
> I actually have a heap profile (pasted at the bottom) from about 1 second
> before the service died (the goroutine that is logging "[Memory]" triggers
> heap profiles once RSS > 100MB). I don't see TCP connections there. Maybe
> it's too few to be sampled. How would I verify your theory? That the
> service dies within 2 seconds after several hours makes it very hard to
> debug.
>
> The top thing in the heap profile is from the reflect package. I initially
> found that suspect, but it turns out this comes from a use of
> httptrace.ClientTrace I had for counting new connections to DynamoDB.
>
>  tracer := {
>ConnectStart: func(_, _ string) {
>  newConns.Inc()
>},
> }
>
> `newConns` is just a prometheus counter. The `tracer` object itself is
> created once and re-used with every client request context. On request,
> `httptrace.WithClientTrace(ctx, tracer)` uses reflection to compose the
> trace functions under-the-hood and uses reflection to invoke it, hence the
> reflect.funcLayout and reflect.Value.call. Objects in `reflect` account for
> about 50% of heap in terms of size, and does seem to grow as the service is
> running out of memory, but that's only 12MB so I thought it was a red
> herring.
>
> Heap profile:
> Type: inuse_space
> Time: Jun 30, 2019 at 4:46am (EDT)
> Entering interactive mode (type "help" for commands, "o" for options)
> (pprof) inuse_objects
> (pprof) top
> Showing nodes accounting for 414485, 100% of 414485 total
> Showing top 10 nodes out of 81
>   flat  flat%   sum%cum   cum%
> 344074 83.01% 83.01% 344074 83.01%  reflect.funcLayout.func1
>  32768  7.91% 90.92% 376842 90.92%  reflect.callReflect
>  16384  3.95% 94.87%  16384  3.95%
> github.com/json-iterator/go.processTags
>  10923  2.64% 97.51%  10923  2.64%  context.WithValue
>   8192  1.98% 99.48%   8192  1.98%  crypto/hmac.New
>   1260   0.3% 99.79%   1260   0.3%
> github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).buildCanonicalHeaders
>820   0.2%   100%820   0.2%
> github.com/stripe/veneur/tdigest.NewMerging
> 64 0.015%   100% 64 0.015%  reflect.addReflectOff
>  0 0%   100%820   0.2%
> git.sqcorp.co/cash/digester/lib/pkg/histogram.(*SlidingWindow

Re: [go-nuts] OOM occurring with a small heap

2019-07-01 Thread Michael Jones
Does adding runtime.GoSched() calls make any difference?

On Mon, Jul 1, 2019 at 5:37 PM 'Yunchi Luo' via golang-nuts <
golang-nuts@googlegroups.com> wrote:

> Following that logic, a leak of TCP connections should manifest as a file
> descriptor leak. We have the process_open_fds metric from Prometheus, that
> is the number of open file descriptors as found in /proc//fd. The
> number of descriptors overtime correlates well with the amount of traffic,
> pretty cyclic. There doesn't appear to be a leak.
>
> We don't do our own memory management and the binary is compiled with
> CGO_ENABLED=0.
>
> I still think the issue I'm seeing should be GC (or heap) related, given
> the explosion in mark & sweep time, HeapSys, HeapIdle, and HeapReleased
> just before the process dies. But I'm lacking ideas on how to track down
> the cause of the increase.
>
> On Mon, Jul 1, 2019 at 5:39 PM Robert Engels 
> wrote:
>
>> I think don't think you are going to find it in the 'heap', rather it
>> would be in native memory.
>>
>> I would use the monitor the /proc/[pid] for the process, and pay
>> attention to the 'fd','net' and 'statm' - if my theory is correct you will
>> see growth here long before the process is killed. Since you are running
>> under k8s and cgroups, you will need to do this along side the Go process
>> (unless you have root access to the server).
>>
>> I 'think' depending on kernel version, that kernel memory used goes
>> against the process for OOM purposes, so this is a likely candidate if
>> pprof is showing nothing.
>>
>> Do you by chance do any of your own memory management (via malloc/CGO)?
>> If so, this is not going to show in pprof either.
>>
>> -Original Message-
>> From: 'Yunchi Luo' via golang-nuts
>> Sent: Jul 1, 2019 4:26 PM
>> To: Robert Engels
>> Cc: golang-nuts@googlegroups.com, Alec Thomas
>> Subject: Re: [go-nuts] OOM occurring with a small heap
>>
>> I actually have a heap profile (pasted at the bottom) from about 1 second
>> before the service died (the goroutine that is logging "[Memory]" triggers
>> heap profiles once RSS > 100MB). I don't see TCP connections there. Maybe
>> it's too few to be sampled. How would I verify your theory? That the
>> service dies within 2 seconds after several hours makes it very hard to
>> debug.
>>
>> The top thing in the heap profile is from the reflect package. I
>> initially found that suspect, but it turns out this comes from a use of
>> httptrace.ClientTrace I had for counting new connections to DynamoDB.
>>
>>  tracer := {
>>ConnectStart: func(_, _ string) {
>>  newConns.Inc()
>>},
>> }
>>
>> `newConns` is just a prometheus counter. The `tracer` object itself is
>> created once and re-used with every client request context. On request,
>> `httptrace.WithClientTrace(ctx, tracer)` uses reflection to compose the
>> trace functions under-the-hood and uses reflection to invoke it, hence the
>> reflect.funcLayout and reflect.Value.call. Objects in `reflect` account for
>> about 50% of heap in terms of size, and does seem to grow as the service is
>> running out of memory, but that's only 12MB so I thought it was a red
>> herring.
>>
>> Heap profile:
>> Type: inuse_space
>> Time: Jun 30, 2019 at 4:46am (EDT)
>> Entering interactive mode (type "help" for commands, "o" for options)
>> (pprof) inuse_objects
>> (pprof) top
>> Showing nodes accounting for 414485, 100% of 414485 total
>> Showing top 10 nodes out of 81
>>   flat  flat%   sum%cum   cum%
>> 344074 83.01% 83.01% 344074 83.01%  reflect.funcLayout.func1
>>  32768  7.91% 90.92% 376842 90.92%  reflect.callReflect
>>  16384  3.95% 94.87%  16384  3.95%
>> github.com/json-iterator/go.processTags
>>  10923  2.64% 97.51%  10923  2.64%  context.WithValue
>>   8192  1.98% 99.48%   8192  1.98%  crypto/hmac.New
>>   1260   0.3% 99.79%   1260   0.3%
>> github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).buildCanonicalHeaders
>>820   0.2%   100%820   0.2%
>> github.com/stripe/veneur/tdigest.NewMerging
>> 64 0.015%   100% 64 0.015%  reflect.addReflectOff
>>  0 0%   100%820   0.2%
>> git.sqcorp.co/cash/digester/lib/pkg/histogram.(*SlidingWindowDigest).Observe
>>  0 0%   100%820   0.2%
>> git.sqcorp.co/cash/digester/lib/pkg/histogram.(*SlidingWindowDigest).openDigests
>> (pprof) cum
>> (pprof) top
>>

Re: [go-nuts] OOM occurring with a small heap

2019-07-01 Thread 'Yunchi Luo' via golang-nuts
Following that logic, a leak of TCP connections should manifest as a file
descriptor leak. We have the process_open_fds metric from Prometheus, that
is the number of open file descriptors as found in /proc//fd. The
number of descriptors overtime correlates well with the amount of traffic,
pretty cyclic. There doesn't appear to be a leak.

We don't do our own memory management and the binary is compiled with
CGO_ENABLED=0.

I still think the issue I'm seeing should be GC (or heap) related, given
the explosion in mark & sweep time, HeapSys, HeapIdle, and HeapReleased
just before the process dies. But I'm lacking ideas on how to track down
the cause of the increase.

On Mon, Jul 1, 2019 at 5:39 PM Robert Engels  wrote:

> I think don't think you are going to find it in the 'heap', rather it
> would be in native memory.
>
> I would use the monitor the /proc/[pid] for the process, and pay attention
> to the 'fd','net' and 'statm' - if my theory is correct you will see growth
> here long before the process is killed. Since you are running under k8s and
> cgroups, you will need to do this along side the Go process (unless you
> have root access to the server).
>
> I 'think' depending on kernel version, that kernel memory used goes
> against the process for OOM purposes, so this is a likely candidate if
> pprof is showing nothing.
>
> Do you by chance do any of your own memory management (via malloc/CGO)? If
> so, this is not going to show in pprof either.
>
> -Original Message-
> From: 'Yunchi Luo' via golang-nuts
> Sent: Jul 1, 2019 4:26 PM
> To: Robert Engels
> Cc: golang-nuts@googlegroups.com, Alec Thomas
> Subject: Re: [go-nuts] OOM occurring with a small heap
>
> I actually have a heap profile (pasted at the bottom) from about 1 second
> before the service died (the goroutine that is logging "[Memory]" triggers
> heap profiles once RSS > 100MB). I don't see TCP connections there. Maybe
> it's too few to be sampled. How would I verify your theory? That the
> service dies within 2 seconds after several hours makes it very hard to
> debug.
>
> The top thing in the heap profile is from the reflect package. I initially
> found that suspect, but it turns out this comes from a use of
> httptrace.ClientTrace I had for counting new connections to DynamoDB.
>
>  tracer := {
>ConnectStart: func(_, _ string) {
>  newConns.Inc()
>},
> }
>
> `newConns` is just a prometheus counter. The `tracer` object itself is
> created once and re-used with every client request context. On request,
> `httptrace.WithClientTrace(ctx, tracer)` uses reflection to compose the
> trace functions under-the-hood and uses reflection to invoke it, hence the
> reflect.funcLayout and reflect.Value.call. Objects in `reflect` account for
> about 50% of heap in terms of size, and does seem to grow as the service is
> running out of memory, but that's only 12MB so I thought it was a red
> herring.
>
> Heap profile:
> Type: inuse_space
> Time: Jun 30, 2019 at 4:46am (EDT)
> Entering interactive mode (type "help" for commands, "o" for options)
> (pprof) inuse_objects
> (pprof) top
> Showing nodes accounting for 414485, 100% of 414485 total
> Showing top 10 nodes out of 81
>   flat  flat%   sum%cum   cum%
> 344074 83.01% 83.01% 344074 83.01%  reflect.funcLayout.func1
>  32768  7.91% 90.92% 376842 90.92%  reflect.callReflect
>  16384  3.95% 94.87%  16384  3.95%
> github.com/json-iterator/go.processTags
>  10923  2.64% 97.51%  10923  2.64%  context.WithValue
>   8192  1.98% 99.48%   8192  1.98%  crypto/hmac.New
>   1260   0.3% 99.79%   1260   0.3%
> github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).buildCanonicalHeaders
>820   0.2%   100%820   0.2%
> github.com/stripe/veneur/tdigest.NewMerging
> 64 0.015%   100% 64 0.015%  reflect.addReflectOff
>  0 0%   100%820   0.2%
> git.sqcorp.co/cash/digester/lib/pkg/histogram.(*SlidingWindowDigest).Observe
>  0 0%   100%820   0.2%
> git.sqcorp.co/cash/digester/lib/pkg/histogram.(*SlidingWindowDigest).openDigests
> (pprof) cum
> (pprof) top
> Showing nodes accounting for 376842, 90.92% of 414485 total
> Showing top 10 nodes out of 81
>   flat  flat%   sum%cum   cum%
>  0 0% 0% 376842 90.92%
>  net/http/httptrace.(*ClientTrace).compose.func1
>  0 0% 0% 376842 90.92%  reflect.Value.Call
>  0 0% 0% 376842 90.92%  reflect.Value.call
>  32768  7.91%  7.91% 376842 90.92%  reflect.callReflect
>  0 0%  7.91% 376842 90.92%  reflect.makeFuncStub
> 344074 83.01% 90.92% 344074 83.01% 

Re: [go-nuts] OOM occurring with a small heap

2019-07-01 Thread Robert Engels
I think don't think you are going to find it in the 'heap', rather it would be in native memory.I would use the monitor the /proc/[pid] for the process, and pay attention to the 'fd','net' and 'statm' - if my theory is correct you will see growth here long before the process is killed. Since you are running under k8s and cgroups, you will need to do this along side the Go process (unless you have root access to the server).I 'think' depending on kernel version, that kernel memory used goes against the process for OOM purposes, so this is a likely candidate if pprof is showing nothing.Do you by chance do any of your own memory management (via malloc/CGO)? If so, this is not going to show in pprof either.-Original Message-
From: 'Yunchi Luo' via golang-nuts 
Sent: Jul 1, 2019 4:26 PM
To: Robert Engels 
Cc: golang-nuts@googlegroups.com, Alec Thomas 
Subject: Re: [go-nuts] OOM occurring with a small heap

I actually have a heap profile (pasted at the bottom) from about 1 second before the service died (the goroutine that is logging "[Memory]" triggers heap profiles once RSS > 100MB). I don't see TCP connections there. Maybe it's too few to be sampled. How would I verify your theory? That the service dies within 2 seconds after several hours makes it very hard to debug.The top thing in the heap profile is from the reflect package. I initially found that suspect, but it turns out this comes from a use of httptrace.ClientTrace I had for counting new connections to DynamoDB. tracer := {   ConnectStart: func(_, _ string) {     newConns.Inc()   },}`newConns` is just a prometheus counter. The `tracer` object itself is created once and re-used with every client request context. On request, `httptrace.WithClientTrace(ctx, tracer)` uses reflection to compose the trace functions under-the-hood and uses reflection to invoke it, hence the reflect.funcLayout and reflect.Value.call. Objects in `reflect` account for about 50% of heap in terms of size, and does seem to grow as the service is running out of memory, but that's only 12MB so I thought it was a red herring.Heap profile:Type: inuse_spaceTime: Jun 30, 2019 at 4:46am (EDT)Entering interactive mode (type "help" for commands, "o" for options)(pprof) inuse_objects(pprof) topShowing nodes accounting for 414485, 100% of 414485 totalShowing top 10 nodes out of 81      flat  flat%   sum%        cum   cum%    344074 83.01% 83.01%     344074 83.01%  reflect.funcLayout.func1     32768  7.91% 90.92%     376842 90.92%  reflect.callReflect     16384  3.95% 94.87%      16384  3.95%  github.com/json-iterator/go.processTags     10923  2.64% 97.51%      10923  2.64%  context.WithValue      8192  1.98% 99.48%       8192  1.98%  crypto/hmac.New      1260   0.3% 99.79%       1260   0.3%  github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).buildCanonicalHeaders       820   0.2%   100%        820   0.2%  github.com/stripe/veneur/tdigest.NewMerging        64 0.015%   100%         64 0.015%  reflect.addReflectOff         0     0%   100%        820   0.2%  git.sqcorp.co/cash/digester/lib/pkg/histogram.(*SlidingWindowDigest).Observe         0     0%   100%        820   0.2%  git.sqcorp.co/cash/digester/lib/pkg/histogram.(*SlidingWindowDigest).openDigests(pprof) cum(pprof) topShowing nodes accounting for 376842, 90.92% of 414485 totalShowing top 10 nodes out of 81      flat  flat%   sum%        cum   cum%         0     0%     0%     376842 90.92%  net/http/httptrace.(*ClientTrace).compose.func1         0     0%     0%     376842 90.92%  reflect.Value.Call         0     0%     0%     376842 90.92%  reflect.Value.call     32768  7.91%  7.91%     376842 90.92%  reflect.callReflect         0     0%  7.91%     376842 90.92%  reflect.makeFuncStub    344074 83.01% 90.92%     344074 83.01%  reflect.funcLayout.func1         0     0% 90.92%     344074 83.01%  sync.(*Pool).Get         0     0% 90.92%      16448  3.97%  github.com/json-iterator/go._createDecoderOfType         0     0% 90.92%      16448  3.97%  github.com/json-iterator/go.createDecoderOfType         0     0% 90.92%      16448  3.97%  github.com/json-iterator/go.decoderOfStructOn Mon, Jul 1, 2019 at 4:32 PM Robert Engels <reng...@ix.netcom.com> wrote:A leak of the TCP connections (maybe not properly closed)? Each TCP connection will use kernel memory and process memory (local buffers), that won't be on the heap (the reference to the TCP connection will be in the Go heap, but is probably much smaller than the buffer allocation).That would be my guess - but just a guess.-Original Message-
From: 'Yunchi Luo' via golang-nuts 
Sent: Jul 1, 2019 2:14 PM
To: golang-nuts@googlegroups.com
Cc: Alec Thomas 
Subject: [go-nuts] OOM occurring with a small heap

Hello, I'd like to solicit some help with a weird GC issue we are seeing.I'm trying to debug OOM on a service we are running in k8s. The service is just a CRUD server hitting a database (DynamoDB). Each replica serves about 300 qp

Re: [go-nuts] OOM occurring with a small heap

2019-07-01 Thread 'Yunchi Luo' via golang-nuts
I actually have a heap profile (pasted at the bottom) from about 1 second
before the service died (the goroutine that is logging "[Memory]" triggers
heap profiles once RSS > 100MB). I don't see TCP connections there. Maybe
it's too few to be sampled. How would I verify your theory? That the
service dies within 2 seconds after several hours makes it very hard to
debug.

The top thing in the heap profile is from the reflect package. I initially
found that suspect, but it turns out this comes from a use of
httptrace.ClientTrace I had for counting new connections to DynamoDB.

 tracer := {
   ConnectStart: func(_, _ string) {
 newConns.Inc()
   },
}

`newConns` is just a prometheus counter. The `tracer` object itself is
created once and re-used with every client request context. On request,
`httptrace.WithClientTrace(ctx, tracer)` uses reflection to compose the
trace functions under-the-hood and uses reflection to invoke it, hence the
reflect.funcLayout and reflect.Value.call. Objects in `reflect` account for
about 50% of heap in terms of size, and does seem to grow as the service is
running out of memory, but that's only 12MB so I thought it was a red
herring.

Heap profile:
Type: inuse_space
Time: Jun 30, 2019 at 4:46am (EDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) inuse_objects
(pprof) top
Showing nodes accounting for 414485, 100% of 414485 total
Showing top 10 nodes out of 81
  flat  flat%   sum%cum   cum%
344074 83.01% 83.01% 344074 83.01%  reflect.funcLayout.func1
 32768  7.91% 90.92% 376842 90.92%  reflect.callReflect
 16384  3.95% 94.87%  16384  3.95%
github.com/json-iterator/go.processTags
 10923  2.64% 97.51%  10923  2.64%  context.WithValue
  8192  1.98% 99.48%   8192  1.98%  crypto/hmac.New
  1260   0.3% 99.79%   1260   0.3%
github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).buildCanonicalHeaders
   820   0.2%   100%820   0.2%
github.com/stripe/veneur/tdigest.NewMerging
64 0.015%   100% 64 0.015%  reflect.addReflectOff
 0 0%   100%820   0.2%
git.sqcorp.co/cash/digester/lib/pkg/histogram.(*SlidingWindowDigest).Observe
 0 0%   100%820   0.2%
git.sqcorp.co/cash/digester/lib/pkg/histogram.(*SlidingWindowDigest).openDigests
(pprof) cum
(pprof) top
Showing nodes accounting for 376842, 90.92% of 414485 total
Showing top 10 nodes out of 81
  flat  flat%   sum%cum   cum%
 0 0% 0% 376842 90.92%
 net/http/httptrace.(*ClientTrace).compose.func1
 0 0% 0% 376842 90.92%  reflect.Value.Call
 0 0% 0% 376842 90.92%  reflect.Value.call
 32768  7.91%  7.91% 376842 90.92%  reflect.callReflect
 0 0%  7.91% 376842 90.92%  reflect.makeFuncStub
344074 83.01% 90.92% 344074 83.01%  reflect.funcLayout.func1
 0 0% 90.92% 344074 83.01%  sync.(*Pool).Get
 0 0% 90.92%  16448  3.97%
github.com/json-iterator/go._createDecoderOfType
 0 0% 90.92%  16448  3.97%
github.com/json-iterator/go.createDecoderOfType
 0 0% 90.92%  16448  3.97%
github.com/json-iterator/go.decoderOfStruct


On Mon, Jul 1, 2019 at 4:32 PM Robert Engels  wrote:

>
> A leak of the TCP connections (maybe not properly closed)? Each TCP
> connection will use kernel memory and process memory (local buffers), that
> won't be on the heap (the reference to the TCP connection will be in the Go
> heap, but is probably much smaller than the buffer allocation).
>
> That would be my guess - but just a guess.
>
> -Original Message-
> From: 'Yunchi Luo' via golang-nuts
> Sent: Jul 1, 2019 2:14 PM
> To: golang-nuts@googlegroups.com
> Cc: Alec Thomas
> Subject: [go-nuts] OOM occurring with a small heap
>
> Hello, I'd like to solicit some help with a weird GC issue we are seeing.
>
> I'm trying to debug OOM on a service we are running in k8s. The service is
> just a CRUD server hitting a database (DynamoDB). Each replica serves about
> 300 qps of traffic. There are no memory leaks. On occasion (seemingly
> correlated to small latency spikes on the backend), the service would OOM.
> This is surprising because it has a circuit breaker that drops requests
> after 200 concurrent connections that has never trips, and goroutine
> profiles confirm that there are nowhere 200 active goroutines.
>
> GC logs are pasted below. It's interlaced with dumps of runtime.Memstats
> (the RSS number is coming from /proc//stats). Go version is 1.12.5,
> running an Alpine 3.10 container in an Amazon kernel
> 4.14.123-111.109.amzn2.x86_64.
>
> The service happily serves requests using ~50MB of RSS for hours, until
> the last 2 seconds, where GC mark time starts to 2-4X per cycle 
> (43+489/158/0.60+0.021
> ms cpu => 43+48

Re: [go-nuts] OOM occurring with a small heap

2019-07-01 Thread Robert Engels
A leak of the TCP connections (maybe not properly closed)? Each TCP connection will use kernel memory and process memory (local buffers), that won't be on the heap (the reference to the TCP connection will be in the Go heap, but is probably much smaller than the buffer allocation).That would be my guess - but just a guess.-Original Message-
From: 'Yunchi Luo' via golang-nuts 
Sent: Jul 1, 2019 2:14 PM
To: golang-nuts@googlegroups.com
Cc: Alec Thomas 
Subject: [go-nuts] OOM occurring with a small heap

Hello, I'd like to solicit some help with a weird GC issue we are seeing.I'm trying to debug OOM on a service we are running in k8s. The service is just a CRUD server hitting a database (DynamoDB). Each replica serves about 300 qps of traffic. There are no memory leaks. On occasion (seemingly correlated to small latency spikes on the backend), the service would OOM. This is surprising because it has a circuit breaker that drops requests after 200 concurrent connections that has never trips, and goroutine profiles confirm that there are nowhere 200 active goroutines.GC logs are pasted below. It's interlaced with dumps of runtime.Memstats (the RSS number is coming from /proc//stats). Go version is 1.12.5, running an Alpine 3.10 container in an Amazon kernel 4.14.123-111.109.amzn2.x86_64.The service happily serves requests using ~50MB of RSS for hours, until the last 2 seconds, where GC mark time starts to 2-4X per cycle (43+489/158/0.60+0.021 ms cpu => 43+489/158/0.60+0.021 ms cpu), and RSS and Sys blow up. It’s also interesting that in the last log line: `Sys=995MB RSS=861MB HeapSys=199MB`. If I’m reading this correctly, there’s at least `662MB` of memory in RSS that is not assigned to the heap. Though this might be due to the change in 1.125 to use MADV_FREE, so the pages are freeable not yet reclaimed by the kernel.I don’t understand how heap can be so small across gc cycles (28->42->30MB on the last line means heap doesn't grow past 42MB?), yet RSS keeps growing. I'm assuming the increased RSS is causing the kernel to OOM the service, but that should only happen if the RSS is not freeable as marked by MADV_FREE. There doesn't seem to be any indication of that from the GC logs. I guess this all comes down to me not having a good understanding of how the GC algorithm works and how to read these logs. I'd really appreciate it if anyone can explain what's happening and why.gc 41833 @19135.227s 0%: 0.019+2.3+0.005 ms clock, 0.079+0.29/2.2/5.6+0.020 ms cpu, 11->11->5 MB, 12 MB goal, 4 PINFO 2019-06-30T08:46:04.886 [Memory]: Alloc=7MB TotalAlloc=230172MB Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MBgc 41834 @19135.869s 0%: 0.005+2.9+0.003 ms clock, 0.023+0.32/2.5/6.6+0.012 ms cpu, 11->11->5 MB, 12 MB goal, 4 PINFO 2019-06-30T08:46:05.886 [Memory]: Alloc=9MB TotalAlloc=230179MB Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MBgc 41835 @19136.704s 0%: 0.038+2.1+0.004 ms clock, 0.15+0.35/2.1/5.3+0.016 ms cpu, 11->11->5 MB, 12 MB goal, 4 PINFO 2019-06-30T08:46:06.886 [Memory]: Alloc=9MB TotalAlloc=230184MB Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MBgc 41836 @19137.611s 0%: 0.009+2.1+0.003 ms clock, 0.036+0.39/2.0/5.7+0.015 ms cpu, 11->11->5 MB, 12 MB goal, 4 PINFO 2019-06-30T08:46:07.887 [Memory]: Alloc=10MB TotalAlloc=230190MB Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB HeapReleased=5MBgc 41837 @19138.444s 0%: 0.008+2.1+0.004 ms clock, 0.035+0.51/2.1/5.7+0.017 ms cpu, 11->11->5 MB, 12 MB goal, 4 PINFO 2019-06-30T08:46:08.887 [Memory]: Alloc=10MB TotalAlloc=230195MB Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB HeapReleased=5MBgc 41838 @19139.474s 0%: 0.005+2.6+0.003 ms clock, 0.023+0.37/2.5/4.3+0.014 ms cpu, 11->11->5 MB, 12 MB goal, 4 Pgc 41839 @19140.173s 0%: 0.011+2.4+0.003 ms clock, 0.046+0.20/2.3/5.8+0.015 ms cpu, 11->11->5 MB, 12 MB goal, 4 PINFO 2019-06-30T08:46:09.887 [Memory]: Alloc=7MB TotalAlloc=230202MB Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB HeapReleased=5MBgc 41840 @19140.831s 0%: 0.082+2.1+0.003 ms clock, 0.32+0.64/2.1/5.3+0.014 ms cpu, 11->11->5 MB, 12 MB goal, 4 PINFO 2019-06-30T08:46:10.887 [Memory]: Alloc=9MB TotalAlloc=230209MB Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MBgc 41841 @19141.655s 0%: 0.014+2.1+0.003 ms clock, 0.056+0.28/2.0/5.7+0.013 ms cpu, 11->11->5 MB, 12 MB goal, 4 Pgc 41842 @19142.316s 0%: 0.006+2.7+0.003 ms clock, 0.027+0.29/2.6/6.2+0.014 ms cpu, 11->11->5 MB, 12 MB goal, 4 PINFO 2019-06-30T08:46:11.888 [Memory]: Alloc=6MB TotalAlloc=230216MB Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MBgc 41843 @19142.942s 0%: 0.010+2.1+0.005 ms clock, 0.040+0.29/2.0/5.7+0.023 ms cpu, 11->11->5 MB, 12 MB goal, 4 PINFO 2019-06-30T08:46:12.888 [Memory]: Alloc=9MB TotalAlloc=230223MB Sys=69MB RSS=51MB HeapSys=62MB 

[go-nuts] OOM occurring with a small heap

2019-07-01 Thread 'Yunchi Luo' via golang-nuts
Hello, I'd like to solicit some help with a weird GC issue we are seeing.

I'm trying to debug OOM on a service we are running in k8s. The service is
just a CRUD server hitting a database (DynamoDB). Each replica serves about
300 qps of traffic. There are no memory leaks. On occasion (seemingly
correlated to small latency spikes on the backend), the service would OOM.
This is surprising because it has a circuit breaker that drops requests
after 200 concurrent connections that has never trips, and goroutine
profiles confirm that there are nowhere 200 active goroutines.

GC logs are pasted below. It's interlaced with dumps of runtime.Memstats
(the RSS number is coming from /proc//stats). Go version is 1.12.5,
running an Alpine 3.10 container in an Amazon kernel
4.14.123-111.109.amzn2.x86_64.

The service happily serves requests using ~50MB of RSS for hours, until the
last 2 seconds, where GC mark time starts to 2-4X per cycle
(43+489/158/0.60+0.021
ms cpu => 43+489/158/0.60+0.021 ms cpu), and RSS and Sys blow up. It’s also
interesting that in the last log line: `Sys=995MB RSS=861MB HeapSys=199MB`.
If I’m reading this correctly, there’s at least `662MB` of memory in RSS
that is not assigned to the heap. Though this might be due to the change in
1.125 to use MADV_FREE, so the pages are freeable not yet reclaimed by the
kernel.

I don’t understand how heap can be so small across gc cycles (28->42->30MB
on the last line means heap doesn't grow past 42MB?), yet RSS keeps
growing. I'm assuming the increased RSS is causing the kernel to OOM the
service, but that should only happen if the RSS is not freeable as marked
by MADV_FREE. There doesn't seem to be any indication of that from the GC
logs. I guess this all comes down to me not having a good understanding of
how the GC algorithm works and how to read these logs. I'd really
appreciate it if anyone can explain what's happening and why.

gc 41833 @19135.227s 0%: 0.019+2.3+0.005 ms clock, 0.079+0.29/2.2/5.6+0.020
ms cpu, 11->11->5 MB, 12 MB goal, 4 P
INFO 2019-06-30T08:46:04.886 [Memory]: Alloc=7MB TotalAlloc=230172MB
Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB
gc 41834 @19135.869s 0%: 0.005+2.9+0.003 ms clock, 0.023+0.32/2.5/6.6+0.012
ms cpu, 11->11->5 MB, 12 MB goal, 4 P
INFO 2019-06-30T08:46:05.886 [Memory]: Alloc=9MB TotalAlloc=230179MB
Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB
gc 41835 @19136.704s 0%: 0.038+2.1+0.004 ms clock, 0.15+0.35/2.1/5.3+0.016
ms cpu, 11->11->5 MB, 12 MB goal, 4 P
INFO 2019-06-30T08:46:06.886 [Memory]: Alloc=9MB TotalAlloc=230184MB
Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB
gc 41836 @19137.611s 0%: 0.009+2.1+0.003 ms clock, 0.036+0.39/2.0/5.7+0.015
ms cpu, 11->11->5 MB, 12 MB goal, 4 P
INFO 2019-06-30T08:46:07.887 [Memory]: Alloc=10MB TotalAlloc=230190MB
Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB HeapReleased=5MB
gc 41837 @19138.444s 0%: 0.008+2.1+0.004 ms clock, 0.035+0.51/2.1/5.7+0.017
ms cpu, 11->11->5 MB, 12 MB goal, 4 P
INFO 2019-06-30T08:46:08.887 [Memory]: Alloc=10MB TotalAlloc=230195MB
Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB HeapReleased=5MB
gc 41838 @19139.474s 0%: 0.005+2.6+0.003 ms clock, 0.023+0.37/2.5/4.3+0.014
ms cpu, 11->11->5 MB, 12 MB goal, 4 P
gc 41839 @19140.173s 0%: 0.011+2.4+0.003 ms clock, 0.046+0.20/2.3/5.8+0.015
ms cpu, 11->11->5 MB, 12 MB goal, 4 P
INFO 2019-06-30T08:46:09.887 [Memory]: Alloc=7MB TotalAlloc=230202MB
Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB
gc 41840 @19140.831s 0%: 0.082+2.1+0.003 ms clock, 0.32+0.64/2.1/5.3+0.014
ms cpu, 11->11->5 MB, 12 MB goal, 4 P
INFO 2019-06-30T08:46:10.887 [Memory]: Alloc=9MB TotalAlloc=230209MB
Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB
gc 41841 @19141.655s 0%: 0.014+2.1+0.003 ms clock, 0.056+0.28/2.0/5.7+0.013
ms cpu, 11->11->5 MB, 12 MB goal, 4 P
gc 41842 @19142.316s 0%: 0.006+2.7+0.003 ms clock, 0.027+0.29/2.6/6.2+0.014
ms cpu, 11->11->5 MB, 12 MB goal, 4 P
INFO 2019-06-30T08:46:11.888 [Memory]: Alloc=6MB TotalAlloc=230216MB
Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB
gc 41843 @19142.942s 0%: 0.010+2.1+0.005 ms clock, 0.040+0.29/2.0/5.7+0.023
ms cpu, 11->11->5 MB, 12 MB goal, 4 P
INFO 2019-06-30T08:46:12.888 [Memory]: Alloc=9MB TotalAlloc=230223MB
Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB
gc 41844 @19143.724s 0%: 0.008+2.4+0.004 ms clock, 0.035+0.38/2.0/5.7+0.017
ms cpu, 11->11->5 MB, 12 MB goal, 4 P
gc 41845 @19144.380s 0%: 10+9.3+0.044 ms clock, 43+6.1/9.2/4.4+0.17 ms cpu,
11->11->6 MB, 12 MB goal, 4 P
INFO 2019-06-30T08:46:13.901 [Memory]: Alloc=6MB TotalAlloc=230230MB
Sys=136MB RSS=98MB HeapSys=94MB HeapIdle=83MB HeapInUse=11MB
HeapReleased=35MB
gc 41846 @19144.447s 0%: 0.008+26+0.005 ms clock, 0.033+0.46/7.8/26+0.020
ms cpu, 11->12->9 MB, 12 MB goal, 4 P
gc 41847 @19144.672s 0%: