Re: [I] [Bug]: CouchDB 3.5.1 upgrade leads to elevated memory and instability (OOM on single node + full cluster drop during rolling upgrade) [couchdb]

via GitHub Fri, 06 Feb 2026 15:08:28 -0800


oisheeaa commented on issue #5879:
URL: https://github.com/apache/couchdb/issues/5879#issuecomment-3862974009


   > Thanks for responding [@oisheeaa](https://github.com/oisheeaa).
   > 
   > From the logs evidence, it doesn't seems like there is any specific error 
in the logs, for example from a process repeatedly crashing? The oom logs and 
`mem3_distribution : node couchdb@primary down, reason: net_tick_timeout` come 
from nodes disconnecting or being killed by the OOM.
   > 
   > > During the 3.5.1 rolling upgrade attempt, the cluster fully dropped 
(both primary + secondary target groups went unhealthy).
   > 
   > During the upgrade before taking the node down do you put the node in 
maintenance mode? That's what we usually do - use maintenance mode on the node, 
then wait some time (a few minutes) for all the connections to drain before 
upgrading (load balancer is automatically setup to exclude the node from 
routing based on the output of the `/_up` endpoint).
   > 
   > > /_node/*/_system + queues / process counts
   > 
   > When it gets close too OOM-ing what does the memory breakdown look like 
(how many bytes used by processes, binary, etc)?
   > 
   > Wonder if you could share some of your vm.args setting or config settings? 
Especially if you have any custom settings (changing scheduling, or busy 
waiting, any allocator settings). For config setting any custom limits, what's 
your max_dbs_open set to?
   > 
   > In general, it's hard to tell what may be using the memory. A few way to 
try to control the memory might be to lower max_dbs_open a bit (but if you 
lower it too much you might see all_dbs_active errors in the logs).
   > 
   > I had never tried these settings but if nothing else works maybe it's 
something to experiment with (especially if you have a test/staging environment)
   > 
   > 
https://elixirforum.com/t/elixir-erlang-docker-containers-ram-usage-on-different-oss-kernels/57251/12
   
   Thanks for your response and to add more context from our side:
   - there’s no single process repeatedly crashing in the logs. What we 
consistently see across staging2, preprod, and a separate DevOps test cluster 
is:
       - steady memory growth
       - eventual OOM kill by OS
       - followed by node disconnects
   - During the 3.5.1 rolling upgrade, nodes were drained from the load 
balancer before replacement (AWS target group health checks + deregistration). 
We did not explicitly put the node into CouchDB maintenance mode first
   - Stortly before OOM:
          - ~3 GB on 8GB instances
          - overall memory usage ~70% and then climbed under load
          - _node/_local/_system showed elevated process_count (100k+), with 
process_limit at 262144
   - We are running with a largely stock vm.args, with a small number of common 
production tweaks. No custom allocator settings and no scheduler busy wait 
changes
   ```
   -name couchdb@<node-fqdn>
   -setcookie <shared-cookie>
   -kernel error_logger silent
   -sasl sasl_error_logger false
   -kernel prevent_overlapping_partitions false
   +SDio 16
   +zdbbl 32768
   +Bd -noinput
   -ssl session_lifetime 300
   ```
   - Additional reproduction under controlled load
   In addition to staging2 and preprod, we were able to reproduce the same 
behaviour in a separate 3 node DevOps test cluster running CouchDB 3.5.1 on 
smaller instances (t3.small). 
   Under sustained concurrent workload involving:
      - large numbers of database creates / deletes
      - concurrent document writes
      - high worker counts
   both primary and secondary nodes were OOM killed once memory usage crossed 
~80%, while the trinary node remained healthy. this failure mode was consistent 
with what we observed in staging2 and preprod
   For comparison, our production cluster (still on 3.4.2) runs with:
      - significantly fewer open databases during non-peak (~<1k vs ~8k+ on 
3.5.1 test clusters)
      - the same max_dbs_open = 60000
      - similar ulimit settings
     and has not exhibited this behaviour
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Bug]: CouchDB 3.5.1 upgrade leads to elevated memory and instability (OOM on single node + full cluster drop during rolling upgrade) [couchdb]

Reply via email to