Re: [I] [Bug]: CouchDB 3.5.1 upgrade leads to elevated memory and instability (OOM on single node + full cluster drop during rolling upgrade) [couchdb]

via GitHub Tue, 03 Feb 2026 23:38:48 -0800


oisheeaa commented on issue #5879:
URL: https://github.com/apache/couchdb/issues/5879#issuecomment-3845837096


   > Do you see any crashes in the logs during that time?
   > 
   > Check the output of `/_node/_local/_system` for each node (also accessible 
as `_node/$nodename/_system`). In that response what does `"memory"` look 
like?. Any elevated counts in `"message_queues"`?
   > 
   > It could be related to OTP 27. We are using it in production for quite a 
while and haven't observed anything unusual.
   > 
   > Are there any particular requests that are slow during that time. Say, 
document updates are slower, view reads, changes?
   
   We do have crash evidence and it started the week we upgraded to 3.5.1
   1) Crashes during the incident windows
   Staging2 (single node):
   CouchDB was down ~10 minutes and systemd shows it was OOM killed:
   journalctl:
   `couchdb.service: A process of this unit has been killed by the OOM killer`
   `Failed with result 'oom-kill'`
   This did not happen on the previous CouchDB version, it started in the same 
week we moved staging2 to 3.5.1.
   
   Preprod-east & west (3-node cluster)
   During the 3.5.1 rolling upgrade attempt, the cluster fully dropped (both 
primary + secondary target groups went unhealthy). We hit the same `os_mon` 
failure pattern we’ve seen before when memory pressure spikes:
   - When primary went down / cluster instability on secondary:
          - `mem3_distribution : node couchdb@primary down, reason: 
net_tick_timeout`
          - `rexi_server_mon : cluster unstable` -> later cluster stable
          - `gen_server memsup terminated with reason: {port_died,normal}`
          - `gen_server disksup terminated with reason: {port_died,normal}`
   Then after primary refresh we saw on trinary:
          - `gen_server disksup terminated with reason: {port_died,normal}`
   Operationally, the pattern was:
   - scale down primary first (stable)
   - take down secondary -> primary memory spikes -> target groups unhealthy / 
lost connectivity
   - scale secondary back up -> could SSM, but ALB still unhealthy
   - scale down trinary -> secondary went unhealthy briefly, then recovered
   - primary got refreshed again -> secondary dropped again with memory spike / 
unreachable
   So we’re seeing memory/process pressure symptoms across both single node and 
cluster setups on 3.5.1.
   
   2) /_node/*/_system + queues / process counts (what we have already)
   From preprod today (same 3.5.1 stack), on node:
   - process_count: 121,753
   - process_limit: 262,144
   - and beam.smp RSS was ~3.1 GB on an 8 GB instance
   - beam.smp RSS ~ 3106652 KB
   
   3) We don’t have a single one endpoint always slow captured, what we do have 
is compaction pressure showing up during the same periods:
   - /_active_tasks showed database_compaction on _global_changes 
(document_copy, large total_changes)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Bug]: CouchDB 3.5.1 upgrade leads to elevated memory and instability (OOM on single node + full cluster drop during rolling upgrade) [couchdb]

Reply via email to