oisheeaa commented on issue #5879:
URL: https://github.com/apache/couchdb/issues/5879#issuecomment-3845837096
> Do you see any crashes in the logs during that time?
>
> Check the output of `/_node/_local/_system` for each node (also accessible
as `_node/$nodename/_system`). In that response what does `"memory"` look
like?. Any elevated counts in `"message_queues"`?
>
> It could be related to OTP 27. We are using it in production for quite a
while and haven't observed anything unusual.
>
> Are there any particular requests that are slow during that time. Say,
document updates are slower, view reads, changes?
We do have crash evidence and it started the week we upgraded to 3.5.1
1) Crashes during the incident windows
Staging2 (single node):
CouchDB was down ~10 minutes and systemd shows it was OOM killed:
journalctl:
`couchdb.service: A process of this unit has been killed by the OOM killer`
`Failed with result 'oom-kill'`
This did not happen on the previous CouchDB version, it started in the same
week we moved staging2 to 3.5.1.
Preprod-east & west (3-node cluster)
During the 3.5.1 rolling upgrade attempt, the cluster fully dropped (both
primary + secondary target groups went unhealthy). We hit the same `os_mon`
failure pattern we’ve seen before when memory pressure spikes:
- When primary went down / cluster instability on secondary:
- `mem3_distribution : node couchdb@primary down, reason:
net_tick_timeout`
- `rexi_server_mon : cluster unstable` -> later cluster stable
- `gen_server memsup terminated with reason: {port_died,normal}`
- `gen_server disksup terminated with reason: {port_died,normal}`
Then after primary refresh we saw on trinary:
- `gen_server disksup terminated with reason: {port_died,normal}`
Operationally, the pattern was:
- scale down primary first (stable)
- take down secondary -> primary memory spikes -> target groups unhealthy /
lost connectivity
- scale secondary back up -> could SSM, but ALB still unhealthy
- scale down trinary -> secondary went unhealthy briefly, then recovered
- primary got refreshed again -> secondary dropped again with memory spike /
unreachable
So we’re seeing memory/process pressure symptoms across both single node and
cluster setups on 3.5.1.
2) /_node/*/_system + queues / process counts (what we have already)
From preprod today (same 3.5.1 stack), on node:
- process_count: 121,753
- process_limit: 262,144
- and beam.smp RSS was ~3.1 GB on an 8 GB instance
- beam.smp RSS ~ 3106652 KB
3) We don’t have a single one endpoint always slow captured, what we do have
is compaction pressure showing up during the same periods:
- /_active_tasks showed database_compaction on _global_changes
(document_copy, large total_changes)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]