Switching Go version seems like a stab in the dark. If the OOM symptom does 
show up, you have simply wasted time. If it doesn't show up, you still don't 
know if the bug exists and is simply hiding. Even if you think the bug in Go 
code generation (or GC) and not in your code, there is nothing the Go 
developers can do without a concrete test. So then you have to keep changing 
things until the symptom disappears for a long time and cross fingers the bug 
went away....

If the symptom disappears when you take out httptrace, now you have narrowed 
the hypothesis to httptrace but you haven't proven it. You still don't know if 
removing httptrace changed the timing behavior such that the bug is hiding 
somewhere and if it will show up in future.

This is why one should change as little as possible outside of what is needed 
for testing a specific hypothesis and why you should have a test that tickles 
the bug. By "proving" I mean you must find a specific assertion that is 
violated.

I'd instrument the code under question and somehow capture the history of last 
N seconds just before OOM. Capturing heap profile is just one way to look at 
what your code does. You can find other ways to look at what your code does and 
create a log. For instance, the extra memory use is also a symptom just like 
OOM; what you may want to check is whether your data structures are consistent. 
The actual inconsistency may have occurred long before the OOM crash.

Note that the GC logs allow you to see some aspect of the GC behavior better 
but if you don't understand it well enough, it may seem mysterious (compared to 
your mental model) and later, when you run out of other hypotheses, it may even 
seem suspicious, so I'd be careful about looking at such things :-)

> On Jul 2, 2019, at 11:16 AM, 'Yunchi Luo' via golang-nuts 
> <golang-nuts@googlegroups.com> wrote:
> 
> I removed the httptrace call yesterday and there have been no OOMs yet. Going 
> to let it bake for another day. If OOMs show up again, I'll try reverting to 
> an older Go version tomorrow. Otherwise I'll point my finger at httptrace I 
> guess.
> 
> On Tue, Jul 2, 2019 at 2:15 PM Yunchi Luo <yunchi...@squareup.com 
> <mailto:yunchi...@squareup.com>> wrote:
> I did try to do that! I have 3 heap profiles captured from the ~3 seconds 
> before crash. The only thing particularly suspicious is the httptrace call I 
> mentioned earlier in the thread.
> 
> Diffing 1 to 2
> (pprof) cum
> (pprof) top 50
> Showing nodes accounting for 4604.15kB, 81.69% of 5636.17kB total
>       flat  flat%   sum%        cum   cum%
>          0     0%     0%  5120.16kB 90.84%  
> net/http/httptrace.(*ClientTrace).compose.func1
>          0     0%     0%  5120.16kB 90.84%  reflect.Value.Call
>          0     0%     0%  5120.16kB 90.84%  reflect.Value.call
>          0     0%     0%  5120.16kB 90.84%  reflect.callReflect
>  5120.16kB 90.84% 90.84%  5120.16kB 90.84%  reflect.funcLayout.func1
>          0     0% 90.84%  5120.16kB 90.84%  reflect.makeFuncStub
>          0     0% 90.84%  4604.15kB 81.69%  sync.(*Pool).Get
>          0     0% 90.84%  -516.01kB  9.16%  io.Copy
>          0     0% 90.84%  -516.01kB  9.16%  io.copyBuffer
>          0     0% 90.84%  -516.01kB  9.16%  io/ioutil.devNull.ReadFrom
>  -516.01kB  9.16% 81.69%  -516.01kB  9.16%  io/ioutil.glob..func1
> 
> Diff 2 to 3
> (pprof) top 50
> Showing nodes accounting for 7680.44kB, 100% of 7680.44kB total
>       flat  flat%   sum%        cum   cum%
>          0     0%     0%  6144.18kB 80.00%  
> net/http/httptrace.(*ClientTrace).compose.func1
>          0     0%     0%  6144.18kB 80.00%  reflect.Value.Call
>          0     0%     0%  6144.18kB 80.00%  reflect.Value.call
>   512.01kB  6.67%  6.67%  6144.18kB 80.00%  reflect.callReflect
>          0     0%  6.67%  6144.18kB 80.00%  reflect.makeFuncStub
>  5632.17kB 73.33% 80.00%  5632.17kB 73.33%  reflect.funcLayout.func1
>          0     0% 80.00%  5632.17kB 73.33%  sync.(*Pool).Get
>          0     0% 80.00%  1024.23kB 13.34%  
> github.com/aws/aws-sdk-go/aws/request.(*HandlerList).Run 
> <http://github.com/aws/aws-sdk-go/aws/request.(*HandlerList).Run>
>          0     0% 80.00%  1024.23kB 13.34%  
> github.com/aws/aws-sdk-go/aws/request.(*Request).Send 
> <http://github.com/aws/aws-sdk-go/aws/request.(*Request).Send>
>          0     0% 80.00%  1024.23kB 13.34%  
> github.com/aws/aws-sdk-go/aws/request.(*Request).Sign 
> <http://github.com/aws/aws-sdk-go/aws/request.(*Request).Sign>
>          0     0% 80.00%  1024.23kB 13.34%  
> github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).build 
> <http://github.com/aws/aws-sdk-go/aws/signer/v4.(*signingCtx).build>
>          0     0% 80.00%  1024.23kB 13.34%  
> github.com/aws/aws-sdk-go/aws/signer/v4.SignSDKRequest 
> <http://github.com/aws/aws-sdk-go/aws/signer/v4.SignSDKRequest>
>          0     0% 80.00%  1024.23kB 13.34%  
> github.com/aws/aws-sdk-go/aws/signer/v4.SignSDKRequestWithCurrentTime 
> <http://github.com/aws/aws-sdk-go/aws/signer/v4.SignSDKRequestWithCurrentTime>
>          0     0% 80.00%  1024.23kB 13.34%  
> github.com/aws/aws-sdk-go/aws/signer/v4.Signer.signWithBody 
> <http://github.com/aws/aws-sdk-go/aws/signer/v4.Signer.signWithBody>
>          0     0% 80.00%  1024.23kB 13.34%  
> github.com/aws/aws-sdk-go/service/dynamodb.(*DynamoDB).GetItemWithContext 
> <http://github.com/aws/aws-sdk-go/service/dynamodb.(*DynamoDB).GetItemWithContext>
> 
> On Tue, Jul 2, 2019 at 2:08 PM andrey mirtchovski <mirtchov...@gmail.com 
> <mailto:mirtchov...@gmail.com>> wrote:
> What I have found useful in the past is pprof's ability to diff profiles. 
> That means that if you capture heap profiles at regular intervals you can see 
> a much smaller subset of changes and compare allocation patterns. 
> 
> On Tue, Jul 2, 2019, 10:53 AM 'Yunchi Luo' via golang-nuts 
> <golang-nuts@googlegroups.com <mailto:golang-nuts@googlegroups.com>> wrote:
> I'm not so much pointing my finger at GC as I am hoping GC logs could help 
> tell the story, and that someone with a strong understanding of GC in Go 
> could weigh in here. In the last 4 seconds before OOM, "TotalAlloc" increased 
> by only 80M, yet "HeapIdle" increased to 240M from 50M, RSS increased by 
> 810M. The numbers don't add up for me. A running sum of 80M of heap objects 
> were allocated in the time RSS increased by 10X that. If GC was completely 
> off, I still wouldn't expect this to happen, which makes me want to rule out 
> GC being blocked as a problem. Maybe there was runaway heap reservation 
> because something in my code caused a ton of fragmentation? Is that sane? The 
> heap profile lacking clues is also strange.
> 
> Regarding the possibility of a race, I forgot I do have goroutine profiles 
> captured along with the heap profiles at the time memory exploded. There are 
> only 10 goroutines running on the serving path, which rules out too many 
> concurrent requests being served (please correct me if I'm wrong). Those fan 
> out to 13 goroutines talking to the db, all of which are in 
> http.Transport.roundTrip (which is blocked on runtime.gopark so I assume they 
> are waiting on the TCP connection). All other goroutines that don't originate 
> in the stdlib are also blocked on `select` or sleeping. Our CI does run with 
> go test -race, but I'll try doing some load testing with a race detector 
> enabled binary.
> 
> Bakul, that is sound advice. I've actually been debugging this on and off for 
> a couple months now, with the help of several people, a few of which have 
> peer reviewed the code. I agree it's most likely to be some runaway code that 
> I caused in my logic, but we haven't been able to pin-point the cause and 
> I've run out of hypothesis to test at the moment. This is why I've started 
> asking on go-nuts@. The circuit breaker code was one of the first things I 
> checked, has been unit tested and verified working with load tests. Now that 
> you mention it, I actually did uncover a Go stdlib bug in http2 while doing 
> the load tests... but that's unrelated.
> 
> 
> On Tue, Jul 2, 2019 at 2:24 AM Bakul Shah <ba...@bitblocks.com 
> <mailto:ba...@bitblocks.com>> wrote:
> Before assuming it is the GC or something system related, you may wish to 
> verify it is *not your own logic*. Larger RSS could also be due to your own 
> logic touching more and more memory due to some runaway effect. The 
> probability this has to do with GC is very low given the very widespread use 
> of Go and the probability of a bug in new code is very high given it is used 
> on a much much smaller scale. 
> 
> This has the "smell" of a concurrency bug. If I were you I'd test the code 
> for any races, I'd review the code thoroughly with someone who doesn't know 
> the code so that I'm forced to explain it, and I'd add plenty of assertions. 
> I'd probably first look at the circuit breaker code -- things like how does 
> it know how many concurrent connections exist?
> 
> In general, any hypothesis you come up with, you should have a test that 
> *catches* the bug given the hypothesis. Elusive bugs tend to become more 
> elusive as you are on the hunt and as you may fix other problems you discover 
> on the way.
> 
> I even suspect you're looking at GC logs a bit too early. Instrument your own 
> code and look at what patterns emerge. [Not to mention any time you spend on 
> understanding your code will help improve your service; but better 
> understanding of and debugging the GC won't necessarily help you!]
> 
>> On Jul 1, 2019, at 12:14 PM, 'Yunchi Luo' via golang-nuts 
>> <golang-nuts@googlegroups.com <mailto:golang-nuts@googlegroups.com>> wrote:
>> 
>> Hello, I'd like to solicit some help with a weird GC issue we are seeing.
>> 
>> I'm trying to debug OOM on a service we are running in k8s. The service is 
>> just a CRUD server hitting a database (DynamoDB). Each replica serves about 
>> 300 qps of traffic. There are no memory leaks. On occasion (seemingly 
>> correlated to small latency spikes on the backend), the service would OOM. 
>> This is surprising because it has a circuit breaker that drops requests 
>> after 200 concurrent connections that has never trips, and goroutine 
>> profiles confirm that there are nowhere 200 active goroutines.
>> 
>> GC logs are pasted below. It's interlaced with dumps of runtime.Memstats 
>> (the RSS number is coming from /proc/<pid>/stats). Go version is 1.12.5, 
>> running an Alpine 3.10 container in an Amazon kernel 
>> 4.14.123-111.109.amzn2.x86_64.
>> 
>> The service happily serves requests using ~50MB of RSS for hours, until the 
>> last 2 seconds, where GC mark&sweep time starts to 2-4X per cycle 
>> (43+489/158/0.60+0.021 ms cpu => 43+489/158/0.60+0.021 ms cpu), and RSS and 
>> Sys blow up. It’s also interesting that in the last log line: `Sys=995MB 
>> RSS=861MB HeapSys=199MB`. If I’m reading this correctly, there’s at least 
>> `662MB` of memory in RSS that is not assigned to the heap. Though this might 
>> be due to the change in 1.125 to use MADV_FREE, so the pages are freeable 
>> not yet reclaimed by the kernel.
>> 
>> I don’t understand how heap can be so small across gc cycles (28->42->30MB 
>> on the last line means heap doesn't grow past 42MB?), yet RSS keeps growing. 
>> I'm assuming the increased RSS is causing the kernel to OOM the service, but 
>> that should only happen if the RSS is not freeable as marked by MADV_FREE. 
>> There doesn't seem to be any indication of that from the GC logs. I guess 
>> this all comes down to me not having a good understanding of how the GC 
>> algorithm works and how to read these logs. I'd really appreciate it if 
>> anyone can explain what's happening and why.
>> 
>> gc 41833 @19135.227s 0%: 0.019+2.3+0.005 ms clock, 0.079+0.29/2.2/5.6+0.020 
>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:04.886 [Memory]: Alloc=7MB TotalAlloc=230172MB 
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB
>> gc 41834 @19135.869s 0%: 0.005+2.9+0.003 ms clock, 0.023+0.32/2.5/6.6+0.012 
>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:05.886 [Memory]: Alloc=9MB TotalAlloc=230179MB 
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB
>> gc 41835 @19136.704s 0%: 0.038+2.1+0.004 ms clock, 0.15+0.35/2.1/5.3+0.016 
>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:06.886 [Memory]: Alloc=9MB TotalAlloc=230184MB 
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB
>> gc 41836 @19137.611s 0%: 0.009+2.1+0.003 ms clock, 0.036+0.39/2.0/5.7+0.015 
>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:07.887 [Memory]: Alloc=10MB TotalAlloc=230190MB 
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB HeapReleased=5MB
>> gc 41837 @19138.444s 0%: 0.008+2.1+0.004 ms clock, 0.035+0.51/2.1/5.7+0.017 
>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:08.887 [Memory]: Alloc=10MB TotalAlloc=230195MB 
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB HeapReleased=5MB
>> gc 41838 @19139.474s 0%: 0.005+2.6+0.003 ms clock, 0.023+0.37/2.5/4.3+0.014 
>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> gc 41839 @19140.173s 0%: 0.011+2.4+0.003 ms clock, 0.046+0.20/2.3/5.8+0.015 
>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:09.887 [Memory]: Alloc=7MB TotalAlloc=230202MB 
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB
>> gc 41840 @19140.831s 0%: 0.082+2.1+0.003 ms clock, 0.32+0.64/2.1/5.3+0.014 
>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:10.887 [Memory]: Alloc=9MB TotalAlloc=230209MB 
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB
>> gc 41841 @19141.655s 0%: 0.014+2.1+0.003 ms clock, 0.056+0.28/2.0/5.7+0.013 
>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> gc 41842 @19142.316s 0%: 0.006+2.7+0.003 ms clock, 0.027+0.29/2.6/6.2+0.014 
>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:11.888 [Memory]: Alloc=6MB TotalAlloc=230216MB 
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB
>> gc 41843 @19142.942s 0%: 0.010+2.1+0.005 ms clock, 0.040+0.29/2.0/5.7+0.023 
>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:12.888 [Memory]: Alloc=9MB TotalAlloc=230223MB 
>> Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB
>> gc 41844 @19143.724s 0%: 0.008+2.4+0.004 ms clock, 0.035+0.38/2.0/5.7+0.017 
>> ms cpu, 11->11->5 MB, 12 MB goal, 4 P
>> gc 41845 @19144.380s 0%: 10+9.3+0.044 ms clock, 43+6.1/9.2/4.4+0.17 ms cpu, 
>> 11->11->6 MB, 12 MB goal, 4 P
>> INFO 2019-06-30T08:46:13.901 [Memory]: Alloc=6MB TotalAlloc=230230MB 
>> Sys=136MB RSS=98MB HeapSys=94MB HeapIdle=83MB HeapInUse=11MB 
>> HeapReleased=35MB
>> gc 41846 @19144.447s 0%: 0.008+26+0.005 ms clock, 0.033+0.46/7.8/26+0.020 ms 
>> cpu, 11->12->9 MB, 12 MB goal, 4 P
>> gc 41847 @19144.672s 0%: 0.013+76+0.006 ms clock, 0.053+0.20/6.4/80+0.024 ms 
>> cpu, 17->18->8 MB, 18 MB goal, 4 P
>> gc 41848 @19145.014s 0%: 0.008+172+0.005 ms clock, 0.035+0.13/8.5/177+0.022 
>> ms cpu, 15->17->10 MB, 16 MB goal, 4 P
>> gc 41849 @19145.298s 0%: 0.007+285+0.006 ms clock, 0.030+10/285/7.6+0.024 ms 
>> cpu, 19->23->15 MB, 20 MB goal, 4 P
>> INFO 2019-06-30T08:46:15.052 [Memory]: Alloc=22MB TotalAlloc=230264MB 
>> Sys=598MB RSS=531MB HeapSys=265MB HeapIdle=240MB HeapInUse=25MB 
>> HeapReleased=164MB
>> gc 41850 @19145.665s 0%: 10+419+0.005 ms clock, 43+489/158/0.60+0.021 ms 
>> cpu, 26->30->17 MB, 30 MB goal, 4 P
>> gc 41851 @19146.325s 0%: 21+798+0.036 ms clock, 86+990/401/0+0.14 ms cpu, 
>> 28->42->30 MB, 34 MB goal, 4 P
>> INFO 2019-06-30T08:46:16.613 [Memory]: Alloc=41MB TotalAlloc=230303MB 
>> Sys=995MB RSS=861MB HeapSys=199MB HeapIdle=155MB HeapInUse=44MB 
>> HeapReleased=54MB
>> 
>> I also captured the OOM log from dmesg here 
>> https://gist.github.com/mightyguava/7ecc6fc55f5cd925062d6beede3783b3 
>> <https://gist.github.com/mightyguava/7ecc6fc55f5cd925062d6beede3783b3>.
>> 
>> -- 
>> Yunchi
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to golang-nuts+unsubscr...@googlegroups.com 
>> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/golang-nuts/CANnT9sj1_sZCKDkGbkzarwcn8DYEX9OS6Ack%2B71613eyLQ7y6w%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/golang-nuts/CANnT9sj1_sZCKDkGbkzarwcn8DYEX9OS6Ack%2B71613eyLQ7y6w%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>> For more options, visit https://groups.google.com/d/optout 
>> <https://groups.google.com/d/optout>.
> 
> 
> 
> -- 
> Yunchi
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts+unsubscr...@googlegroups.com 
> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/golang-nuts/CANnT9sjNpE8wjqv6n%2BbHyZJ_cCvwN3O9rHKTT3%3DdSqZah0PfHA%40mail.gmail.com
>  
> <https://groups.google.com/d/msgid/golang-nuts/CANnT9sjNpE8wjqv6n%2BbHyZJ_cCvwN3O9rHKTT3%3DdSqZah0PfHA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout 
> <https://groups.google.com/d/optout>.
> 
> 
> -- 
> Yunchi
> 
> 
> -- 
> Yunchi
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts+unsubscr...@googlegroups.com 
> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/golang-nuts/CANnT9shkUa04vaL%3D4jY1is1%3Dj2Vov5a%2BX7zSPoQCwcM8_UQpUA%40mail.gmail.com
>  
> <https://groups.google.com/d/msgid/golang-nuts/CANnT9shkUa04vaL%3D4jY1is1%3Dj2Vov5a%2BX7zSPoQCwcM8_UQpUA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout 
> <https://groups.google.com/d/optout>.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/AE1A0D12-B802-41F8-A5C3-56DDE826796A%40bitblocks.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to