natarajaya opened a new issue #2120: Missing metrics to monitor internal replication status URL: https://github.com/apache/couchdb/issues/2120 This is more a general design question than a bug report. ## Description We are running our CouchDB clusters on GKE. Entire setup is very simple: each cluster has 3 nodes, default settings (`q=8,n=3`). To make sure that our clusters are healthy, we monitor: * `/_membership` data on every node, to verify that every node has connectivity to other nodes. * `couchdb_httpd_request_time` [metric](https://github.com/gesellix/couchdb-prometheus-exporter/blob/master/README_metrics.md), which contains "length of a request inside CouchDB without MochiWeb", to verify that every node responds in a reasonable time. We are looking to improve our monitoring solution to cover more failure modes, and have couple of questions: * Is there a way to determine that node is lagging to process writes? * After split brain situations (node lost connectivity to other nodes, but now connectivity is restored and node is syncing data) – is there a way to determine "replication lag" or, in other words, amount of documents that are left to update? ## Your Environment GKE, CouchDB is installed with Helm using semi-official chart: https://github.com/gpii-ops/gpii-infra/tree/master/shared/charts/couchdb * CouchDB Version used: 2.3.1 * Browser name and version: None * Operating System and version: None, GKE
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services