AlexanderKaraberov opened a new issue #2232: Elusive refc binary memory leaks in index_updater and replicator_worker processes URL: https://github.com/apache/couchdb/issues/2232 ## Description After we have switched to sharding all our databases by default with a sharding factor of 8 (`q=8`), I started to observe and increased RAM consumption by a CouchDB node. This is totally explainable as we there are now much more `couch_db_updater`, `index_updater` and `couch_file` processes each consuming some non-zero amount of memory. But nonetheless. But this increase doesn't seem to be proportional and Linux kernel reported RSS for `beam.smp` jumped from our standard 6GB to 22GB in some cases. I was suspecting `refc` binaries are to blame therefore I ran a function which takes a snapshot of the number of binary refs in all VM processes, then GCs the node entirely, and then take another snapshot and makes a diff (Fred Herbert's `recon` library which we use for our CouchDB nicely abstracts this) I've managed to free almost 10GB on the node. After this I waited for ~15-20 mins (this was happening during the busy day on one of our servers) and `msacc` reported that node was under heavy utilisation. RSS for beam.smp went up again by 3.2GB and VSZ by 3.56GB. Suffice to say that after running the same routine in the `remsh` memory reports went back to previous values. By looking at the `bin_leak` report of how many individual binaries were held and then freed by each process as a delta I've noticed that top most were `couch_replicator_worker` and `couch_index_updater`. I think this partially explains current situation because those are definitely not short-lived processes and therefore hold ProcBins to big binary content of documents, etc. To conclude I don't see a quick fix here except for some major refactoring by introducing short-lived processes or some temporary one-off processes which eventually might not be compatible with current indexer and replicator architecture. But perhaps there are some quick wins which may incorporate in terms of CouchDB utilisation? ## Expected Behaviour In the ideal world I would like not to see this elusive microscopic leaks which in the grand scheme of things (sharding with 8/16 + 800 heavily utilised DBs + replications) result in 10-12GB of RAM taken for nothing. `ERL_FULLSWEEP_AFTER 0 or some_small_value` is not an option unfortunately because it slow downs node performance and by looking at the microstate accounting I was not happy to see some schedulers at 100% GC. For now we decided to adhere to do full sweep garbage collection manually at one hour intervals from one of our auxiliary Erlang VM monitoring processes connected to CouchDB as `hidden`. This seems to work well without the downsides of doing full sweeps after 0 minor ones. But still this is a kludge and doesn't sound like a good approach. ## Your Environment * CouchDB Version used: 2.3.0 + custom patch set * Operating System and version: Debian 9 (stretch), Linux kernel 4.9.189.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
