In the interest of making CouchDB 3.0 "the best CouchDB Classic possible",
I'd like to discuss whether to accept a donation from Cloudant of the
"Weather Report" diagnostic tool. This tool (and dependencies) are OTP
applications, and it is typically run from an escript which connects to a
running cluster, gathers numerous diagnostics, and emits various warning
and errors when it finds something to complain about. It was originally
ported from a fork of Riaknostic (the Automated diagnostic tools for Riak)
[1] by Mike Wallace.

The checks it makes are represented by the following modules:

weatherreport_check_custodian.erl
weatherreport_check_disk.erl
weatherreport_check_internal_replication.erl
weatherreport_check_ioq.erl
weatherreport_check_mem3_sync.erl
weatherreport_check_membership.erl
weatherreport_check_memory_use.erl
weatherreport_check_message_queues.erl
weatherreport_check_node_stats.erl
weatherreport_check_nodes_connected.erl
weatherreport_check_process_calls.erl
weatherreport_check_process_memory.erl
weatherreport_check_safe_to_rebuild.erl
weatherreport_check_search.erl
weatherreport_check_tcp_queues.erl

While some of these checks are self-contained, check_node_stats,
check_process_calls, check_process_memory, and check_message_queues all use
recon [2] under the hood. Similarly, check_custodian
and check_safe_to_rebuild use another Cloudant OTP application called
Custodian, which periodically scans the "dbs" database to track the
location of every shard of every database and can integrate with sensu [3]
to ensure that operators are aware of any shard that is under-replicated.

I have created a POC branch [4] that adds Weather Report, Custodian, and
Recon to CouchDB, and when I ran it in my dev environment (without search
running), got the following diagnostic output:

$ ./weatherreport --etc ~/proj/couchdb/dev/lib/node1/etc/ -a
['node1@127.0.0.1'] [error] Local search node at 'clouseau@127.0.0.1' not
responding: pang
['node2@127.0.0.1'] [error] Local search node at 'clouseau@127.0.0.1' not
responding: pang
['node3@127.0.0.1'] [error] Local search node at 'clouseau@127.0.0.1' not
responding: pang
['node1@127.0.0.1'] [notice] Data directory
/Users/jay/proj/couchdb/dev/lib/node1/data is not mounted with 'noatime'.
Please remount its disk with the 'noatime' flag to improve performance.
['node2@127.0.0.1'] [notice] Data directory
/Users/jay/proj/couchdb/dev/lib/node2/data is not mounted with 'noatime'.
Please remount its disk with the 'noatime' flag to improve performance.
['node3@127.0.0.1'] [notice] Data directory
/Users/jay/proj/couchdb/dev/lib/node3/data is not mounted with 'noatime'.
Please remount its disk with the 'noatime' flag to improve performance.
returned 1

There is still a little cleanup to be done before these tools would be
ready to donate, but it seems that overall they already integrate tolerably
well with CouchDB.

As far as licenses go, Riaknostic is Apache 2.0. Recon is not [5], but it
seems like it should be ok to include in CouchDB based on my possibly naive
reading. Currently Custodian has no license (just Copyright 2013 Cloudant),
but I assume it would get an Apache license, just like all other donated
code.

Would this be a welcome addition to CouchDB? Please let me know what you
think.

Thanks,
Jay

[1] https://github.com/basho/riaknostic
[2] http://ferd.github.io/recon/
[3] https://sensu.io
[4]
https://github.com/apache/couchdb/compare/master...cloudant:weatherreport?expand=1
[5] https://github.com/ferd/recon/blob/master/LICENSE

Reply via email to