AlexanderKaraberov opened a new issue #2232: Elusive refc binary memory leaks 
in index_updater and replicator_worker processes
URL: https://github.com/apache/couchdb/issues/2232
 
 
   ## Description
   
   After we have switched to sharding all our databases by default with a 
sharding factor of 8 (`q=8`), I started to observe and increased RAM 
consumption by a CouchDB node. This is totally explainable as we there are now 
much more `couch_db_updater`, `index_updater` and `couch_file` processes each 
consuming some non-zero amount of memory. But nonetheless. But this increase 
doesn't seem to be proportional and Linux kernel reported RSS for `beam.smp` 
jumped from our standard 6GB to 22GB in some cases. I was suspecting `refc` 
binaries are to blame therefore I ran a function which takes a snapshot of the 
number of binary refs in all VM processes, then GCs the node entirely, and then 
take another snapshot and makes a diff (Fred Herbert's `recon` library which we 
use for our CouchDB nicely abstracts this) I've managed to free almost 10GB on 
the node. After this I waited for ~15-20 mins (this was happening during the 
busy day on one of our servers) and `msacc` reported that node was under heavy 
utilisation. RSS for beam.smp went up again by 3.2GB and VSZ by 3.56GB. Suffice 
to say that after running the same routine in the `remsh` memory reports went 
back to previous values. By looking at the `bin_leak` report  of how many 
individual binaries were held and then freed by each process as a delta I've 
noticed that top most were `couch_replicator_worker` and `couch_index_updater`. 
I think this partially explains current situation because those are definitely 
not short-lived processes and therefore hold ProcBins to big binary content of 
documents, etc. To conclude I don't see a quick fix here except for some major 
refactoring by introducing short-lived processes or some temporary one-off 
processes which eventually might not be compatible with current indexer and 
replicator architecture. But perhaps there are some quick wins which may 
incorporate in terms of CouchDB utilisation? 
   
   ## Expected Behaviour
   
   In the ideal world I would like not to see this elusive microscopic leaks 
which in the grand scheme of things (sharding with 8/16 + 800 heavily utilised 
DBs + replications) result in 10-12GB of RAM taken for nothing. 
   `ERL_FULLSWEEP_AFTER 0 or some_small_value` is not an option unfortunately 
because it slow downs node performance and by looking at the microstate 
accounting I was not happy to see some schedulers at 100% GC. For now we 
decided to adhere to do full sweep garbage collection manually at one hour 
intervals from one of our auxiliary Erlang VM monitoring processes connected to 
CouchDB as `hidden`. This seems to work well without the downsides of doing 
full sweeps after 0 minor ones. But still this is a kludge and doesn't sound 
like a good approach. 
   
   ## Your Environment
   
   * CouchDB Version used: 2.3.0 + custom patch set
   * Operating System and version: Debian 9 (stretch), Linux kernel 4.9.189.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to