Thanks, Alexander, for starting this thread. Your document seems quite illustrative and a great starting point :)
Here is the picture of our system, from the point of view of monitoring: We have two couch servers in multi-master mode. Replication gave us troubles (the replication task on node#1 would crash if the remote couch service on node#2 was stopped (say, for maintenance)). This may seem like "of course, what else do you expect?" behaviour, but really, I wished couchdb could "keep retrying until success" to resume replication, come what may. But since couchdb does not do this, what we did was to write a simple replication watcher (Node.js) whose primary function is to create replication tasks if a task that _should_ be running is not running. To be fair, "monitoring" of replication tasks is available through _active_tasks but I'm really looking for the ability to create 'persisted continuous replication tasks' and not having to now worry about monitoring the monitor (if its in an OTP "supervisor" restarting it, that's entirely a different matter)). The other issue we kept running into is that of some GETs on views would hang. Observed side effects: lots of couchjs processes, _active_tasks does not show compaction or indexer, request_time in _stats goes up, we don't know what is going on. The surprising fact was that these views would perform inversely proportional to the number of views in the design document (ie., if a view is coming from a design doc with lots of other view functions, then this view function call would perform poorly with frequent freezes). We 'worked around' this by moving critical views into its own design docs. (we use couchdb 1.5.0 with some custom non-interfering, sequential hacks (ex: to persist UserCtx and time-of-modification into JSON document and to always snapshot every change (ie., 2 sequential writes per intended doc write))). Anyway, here is what is in our monitoring bucket list which aren't readily available in _stats, so we measure these in indirect ways: 1. number of open tcp sockets to couch. 2. cpu and memory usage of (each) couchjs process. 3. cpu and memory usage of beam.smp process. 4. grep couch.log for crash-like patterns (essentially, internal errors of couch engine, such as replication task crashes). Here are some specific areas that _only_ couch can provide insight (ie., completely internal stuff): 1. fragmentation levels of each DB and each View (TIL from OP: disk_size and data_size can be used to figure this out, but what about each View?). 2. average/min/max doc size 3. hits per (each) view 4. emits per (each) view (because 'emit' causes disk write) 5. size of each index/view (number of nodes in the tree, used size (individual index file size?), etc.,) 6. count of (inbox) 'queued' messages categorized by each 'functional kind' of erlang process (ie., kind == index-update, doc-update, doc-read, mochiweb, etc.,) Perhaps, some of these metrics may not be easily tracked as an in-memory counter in some situations? I wonder, what if there was a way we could listen for significant events happening inside couch? It is not necessary that couch has to even keep track of everything as in-memory counters. Instead some of them could be exposed as events that can be plumbed into a dedicated event stream processor (like Riemann.io) to do whatever monitoring/alerting we may want to do. Regards, -Suraj On Thu, Apr 24, 2014 at 2:46 PM, Alexander Shorin <[email protected]> wrote: > Hi everyone again, > > Actually, I have another one thing to share with you for today: it's > the post-guidance I wrote about monitoring CouchDB: > > http://gws.github.io/munin-plugin-couchdb/guide-to-couchdb-monitoring.html > > While it located in the same repository as plugin for specific > monitoring system, it's completely project neutral (with a bit > exception at the end) and aims to cover all the possibilities for > monitoring CouchDB server state. Just using only /_stats isn't enough. > > If you prefer Zabiix or Nagios or something else you'll also may found > some interesting bits there. > > As for discussion topic I'd like to ask everyone about how you monitor > your CouchDB in production? Which metrics are important for your and > which ones you feel missed? Experience sharing and cool stories about > how monitoring helps you to keep CouchDB good and well is welcome! > > P.S. English isn't my native language, so please if you'd noticed any > misspelling or sentences with incorrect syntax, please don't shy to > send me private email with corrections. Thanks! > > -- > ,,,^..^,,, > -- An Onion is the Onion skin and the Onion under the skin until the Onion Skin without any Onion underneath. -- _____________________________________________________________ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.
