Before assuming it is the GC or something system related, you may wish to verify it is *not your own logic*. Larger RSS could also be due to your own logic touching more and more memory due to some runaway effect. The probability this has to do with GC is very low given the very widespread use of Go and the probability of a bug in new code is very high given it is used on a much much smaller scale.
This has the "smell" of a concurrency bug. If I were you I'd test the code for any races, I'd review the code thoroughly with someone who doesn't know the code so that I'm forced to explain it, and I'd add plenty of assertions. I'd probably first look at the circuit breaker code -- things like how does it know how many concurrent connections exist? In general, any hypothesis you come up with, you should have a test that *catches* the bug given the hypothesis. Elusive bugs tend to become more elusive as you are on the hunt and as you may fix other problems you discover on the way. I even suspect you're looking at GC logs a bit too early. Instrument your own code and look at what patterns emerge. [Not to mention any time you spend on understanding your code will help improve your service; but better understanding of and debugging the GC won't necessarily help you!] > On Jul 1, 2019, at 12:14 PM, 'Yunchi Luo' via golang-nuts > <golang-nuts@googlegroups.com> wrote: > > Hello, I'd like to solicit some help with a weird GC issue we are seeing. > > I'm trying to debug OOM on a service we are running in k8s. The service is > just a CRUD server hitting a database (DynamoDB). Each replica serves about > 300 qps of traffic. There are no memory leaks. On occasion (seemingly > correlated to small latency spikes on the backend), the service would OOM. > This is surprising because it has a circuit breaker that drops requests after > 200 concurrent connections that has never trips, and goroutine profiles > confirm that there are nowhere 200 active goroutines. > > GC logs are pasted below. It's interlaced with dumps of runtime.Memstats (the > RSS number is coming from /proc/<pid>/stats). Go version is 1.12.5, running > an Alpine 3.10 container in an Amazon kernel 4.14.123-111.109.amzn2.x86_64. > > The service happily serves requests using ~50MB of RSS for hours, until the > last 2 seconds, where GC mark&sweep time starts to 2-4X per cycle > (43+489/158/0.60+0.021 ms cpu => 43+489/158/0.60+0.021 ms cpu), and RSS and > Sys blow up. It’s also interesting that in the last log line: `Sys=995MB > RSS=861MB HeapSys=199MB`. If I’m reading this correctly, there’s at least > `662MB` of memory in RSS that is not assigned to the heap. Though this might > be due to the change in 1.125 to use MADV_FREE, so the pages are freeable not > yet reclaimed by the kernel. > > I don’t understand how heap can be so small across gc cycles (28->42->30MB on > the last line means heap doesn't grow past 42MB?), yet RSS keeps growing. I'm > assuming the increased RSS is causing the kernel to OOM the service, but that > should only happen if the RSS is not freeable as marked by MADV_FREE. There > doesn't seem to be any indication of that from the GC logs. I guess this all > comes down to me not having a good understanding of how the GC algorithm > works and how to read these logs. I'd really appreciate it if anyone can > explain what's happening and why. > > gc 41833 @19135.227s 0%: 0.019+2.3+0.005 ms clock, 0.079+0.29/2.2/5.6+0.020 > ms cpu, 11->11->5 MB, 12 MB goal, 4 P > INFO 2019-06-30T08:46:04.886 [Memory]: Alloc=7MB TotalAlloc=230172MB Sys=69MB > RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB > gc 41834 @19135.869s 0%: 0.005+2.9+0.003 ms clock, 0.023+0.32/2.5/6.6+0.012 > ms cpu, 11->11->5 MB, 12 MB goal, 4 P > INFO 2019-06-30T08:46:05.886 [Memory]: Alloc=9MB TotalAlloc=230179MB Sys=69MB > RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB > gc 41835 @19136.704s 0%: 0.038+2.1+0.004 ms clock, 0.15+0.35/2.1/5.3+0.016 ms > cpu, 11->11->5 MB, 12 MB goal, 4 P > INFO 2019-06-30T08:46:06.886 [Memory]: Alloc=9MB TotalAlloc=230184MB Sys=69MB > RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB > gc 41836 @19137.611s 0%: 0.009+2.1+0.003 ms clock, 0.036+0.39/2.0/5.7+0.015 > ms cpu, 11->11->5 MB, 12 MB goal, 4 P > INFO 2019-06-30T08:46:07.887 [Memory]: Alloc=10MB TotalAlloc=230190MB > Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB HeapReleased=5MB > gc 41837 @19138.444s 0%: 0.008+2.1+0.004 ms clock, 0.035+0.51/2.1/5.7+0.017 > ms cpu, 11->11->5 MB, 12 MB goal, 4 P > INFO 2019-06-30T08:46:08.887 [Memory]: Alloc=10MB TotalAlloc=230195MB > Sys=69MB RSS=51MB HeapSys=62MB HeapIdle=49MB HeapInUse=12MB HeapReleased=5MB > gc 41838 @19139.474s 0%: 0.005+2.6+0.003 ms clock, 0.023+0.37/2.5/4.3+0.014 > ms cpu, 11->11->5 MB, 12 MB goal, 4 P > gc 41839 @19140.173s 0%: 0.011+2.4+0.003 ms clock, 0.046+0.20/2.3/5.8+0.015 > ms cpu, 11->11->5 MB, 12 MB goal, 4 P > INFO 2019-06-30T08:46:09.887 [Memory]: Alloc=7MB TotalAlloc=230202MB Sys=69MB > RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB > gc 41840 @19140.831s 0%: 0.082+2.1+0.003 ms clock, 0.32+0.64/2.1/5.3+0.014 ms > cpu, 11->11->5 MB, 12 MB goal, 4 P > INFO 2019-06-30T08:46:10.887 [Memory]: Alloc=9MB TotalAlloc=230209MB Sys=69MB > RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=12MB HeapReleased=5MB > gc 41841 @19141.655s 0%: 0.014+2.1+0.003 ms clock, 0.056+0.28/2.0/5.7+0.013 > ms cpu, 11->11->5 MB, 12 MB goal, 4 P > gc 41842 @19142.316s 0%: 0.006+2.7+0.003 ms clock, 0.027+0.29/2.6/6.2+0.014 > ms cpu, 11->11->5 MB, 12 MB goal, 4 P > INFO 2019-06-30T08:46:11.888 [Memory]: Alloc=6MB TotalAlloc=230216MB Sys=69MB > RSS=51MB HeapSys=62MB HeapIdle=51MB HeapInUse=11MB HeapReleased=5MB > gc 41843 @19142.942s 0%: 0.010+2.1+0.005 ms clock, 0.040+0.29/2.0/5.7+0.023 > ms cpu, 11->11->5 MB, 12 MB goal, 4 P > INFO 2019-06-30T08:46:12.888 [Memory]: Alloc=9MB TotalAlloc=230223MB Sys=69MB > RSS=51MB HeapSys=62MB HeapIdle=50MB HeapInUse=11MB HeapReleased=5MB > gc 41844 @19143.724s 0%: 0.008+2.4+0.004 ms clock, 0.035+0.38/2.0/5.7+0.017 > ms cpu, 11->11->5 MB, 12 MB goal, 4 P > gc 41845 @19144.380s 0%: 10+9.3+0.044 ms clock, 43+6.1/9.2/4.4+0.17 ms cpu, > 11->11->6 MB, 12 MB goal, 4 P > INFO 2019-06-30T08:46:13.901 [Memory]: Alloc=6MB TotalAlloc=230230MB > Sys=136MB RSS=98MB HeapSys=94MB HeapIdle=83MB HeapInUse=11MB HeapReleased=35MB > gc 41846 @19144.447s 0%: 0.008+26+0.005 ms clock, 0.033+0.46/7.8/26+0.020 ms > cpu, 11->12->9 MB, 12 MB goal, 4 P > gc 41847 @19144.672s 0%: 0.013+76+0.006 ms clock, 0.053+0.20/6.4/80+0.024 ms > cpu, 17->18->8 MB, 18 MB goal, 4 P > gc 41848 @19145.014s 0%: 0.008+172+0.005 ms clock, 0.035+0.13/8.5/177+0.022 > ms cpu, 15->17->10 MB, 16 MB goal, 4 P > gc 41849 @19145.298s 0%: 0.007+285+0.006 ms clock, 0.030+10/285/7.6+0.024 ms > cpu, 19->23->15 MB, 20 MB goal, 4 P > INFO 2019-06-30T08:46:15.052 [Memory]: Alloc=22MB TotalAlloc=230264MB > Sys=598MB RSS=531MB HeapSys=265MB HeapIdle=240MB HeapInUse=25MB > HeapReleased=164MB > gc 41850 @19145.665s 0%: 10+419+0.005 ms clock, 43+489/158/0.60+0.021 ms cpu, > 26->30->17 MB, 30 MB goal, 4 P > gc 41851 @19146.325s 0%: 21+798+0.036 ms clock, 86+990/401/0+0.14 ms cpu, > 28->42->30 MB, 34 MB goal, 4 P > INFO 2019-06-30T08:46:16.613 [Memory]: Alloc=41MB TotalAlloc=230303MB > Sys=995MB RSS=861MB HeapSys=199MB HeapIdle=155MB HeapInUse=44MB > HeapReleased=54MB > > I also captured the OOM log from dmesg here > https://gist.github.com/mightyguava/7ecc6fc55f5cd925062d6beede3783b3 > <https://gist.github.com/mightyguava/7ecc6fc55f5cd925062d6beede3783b3>. > > -- > Yunchi > > -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts+unsubscr...@googlegroups.com > <mailto:golang-nuts+unsubscr...@googlegroups.com>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/golang-nuts/CANnT9sj1_sZCKDkGbkzarwcn8DYEX9OS6Ack%2B71613eyLQ7y6w%40mail.gmail.com > > <https://groups.google.com/d/msgid/golang-nuts/CANnT9sj1_sZCKDkGbkzarwcn8DYEX9OS6Ack%2B71613eyLQ7y6w%40mail.gmail.com?utm_medium=email&utm_source=footer>. > For more options, visit https://groups.google.com/d/optout > <https://groups.google.com/d/optout>. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/7311793B-E31A-4356-9CEB-D97EA9E3E6C0%40bitblocks.com. For more options, visit https://groups.google.com/d/optout.