Hello developers, There are a few types of GitHub issues that I'd like to call out for more attention than they are currently getting. These issues are causing serious problems in production today for a growing number of people whose businesses depend on CouchDB. Please help.
=================================================================== 1. Failure to replicate databases with attachments The biggest of these issues is #745 [1]. This is a situation where CouchDB is failing to replicate databases with attachments. Here are some facts that may help isolate the problem: * One user I'm in touch with claims that these problems were not occurring in CouchDB 2.0.0-rc3; they only started seeing these problems after upgrading to 2.1.1. Looking at their cluster logs, I see odd sporadic bursts of fabric_worker_timeouts of a specific shard being unavailable at the same time on all 3 nodes in their 2.1.1 cluster. I can't explain this. (See #3 below, though) * Another user is seeing this in replicating from Cloudant to a local CouchDB 2.1.1 installation. * Initially, we thought this was limited to databases that had large (>=64MB) attachments, but recently we have been made aware of some databases with no attachments larger than 2MB that are still blocked by this issue. * Increasing the value of [httpd] max_http_request_size beyond the 2.1.1 default of 67108864 (64MB) has not worked around the problem. (The default value of this used to be larger.) * Technically, #745 is a recurrence of #574 [2], which was when we first started seeing failures of attachment replication in our automated test suite. In that issue, you can see where Nick Vatamane traced this problem to a race condition in how a 413 (request body is too large) error result is handled. Sadly, after some attempts to fix this, the replication test was disabled... * While investigating this issue, we discovered #1117 [6], which looks like a memory leak but is actually a mochiweb pid leak when active_socket accounting fails. This bugfix needs to be backported into CouchDB's mochiweb fork (ideally, we should just update our mochiweb to the latest release, if possible.) This is possibly responsible for the issue Spiegel author @redgeoff identified in #1063 [7]. * #1125 [8] is likely an occurrence of this in trying to replicate from bigcouch 0.4 to CouchDB 2.1.1. =================================================================== 2. Sporadic 500 errors on writing attachments / replicating There are a few GitHub issues for this: #1096 [3], #1093 [4]. I believe these are related. In #1096, a user is creating a document on a 6-node, q=16 cluster, then immediately attempting to add an attachment to it. Sometimes, they receive a 500 error on this attempt. After waiting ~10s and retrying the same request, it completes successfully. Paul Davis suggested that this user make all changes with ?w=3, and also to increase file handles for the process; these provided temporary relief, but did not solve the problem. In #1093, a user is performing multiple replications at once to a 1-node, q=4 CouchDB 2.1.1 server. They are seeing 500 errors returned during the replication process's PUTs. Under CouchDB 1.6.1, under the same scenario, they see timeouts (but not 500s) that eventually stall out the entire replication process. I worked with this user to try and reproduce the problem, unsuccessfully on my machine, but I have seen the issue on their setup directly; it is definitely real. =================================================================== 3. Mysterious fabric/rexi timeouts / 500s on relatively unloaded systems This one is a bit of a grab bag, but I'm seeing possibly a trend here, if in nothing else than the fact that these timeouts are obscuring the actual problems that are taking place. It's also worrying in that we weren't getting any of these errors reported even just 6 months ago, and now we are seeing them in multiple places. I wonder if we somehow reduced various timeouts to lower thresholds and are only now seeing these problems in the wild? In #1119, a user is unable to GET /{db} without a 500 error occasionally. The logs show a `no_db_file` error for various shards. I had thought it might be related to disk speed, or to docker, but I don't have the time to dig into this one further. (Incidentally, this user is trying to port kazoo off of bigcouch 0.4. kazoo and the 2600 people are long-time CouchDB users; we should do our best to support them in this effort.) In #1008 [9] and #1142 [10], different scenarios are leading to a timeout in `fabric:query_view`. One is attempting to use native Erlang views; the other is an extension to filtering for the `/{db}/_changes` endpoint. In #1163 [11], a user reports a `fabric_rpc_changes` timeout via rexi_server. They are also seeing a bad `binary_to_term` that I can't explain. This may be unrelated and just due to corrupt data. In my own work, after standing up a new 3-node cluster on AWS (with some VERY large nodes and placed carefully in the same AZ using placement groups), I saw a rexi timeout on a completely idle cluster: ``` [warning] 2018-01-25T20:34:20.164277Z couc...@ip-xxx-xx-x-xxx.eu-west-1.compute.internal <0.341.0> -------- mem3_sync shards/00000000-1fffffff/_global_changes.1516889330 couc...@ip-xxx-xx-xx-xx.eu-west-1.compute.internal {timeout,[{mem3_rpc,rexi_call,2,[{file,[115,114,99,47,109,101,109,51,95,114,112,99,46,101,114,108]},{line,271}]},... ``` I didn't log a GH issue for this one...yet... =================================================================== OK, that's enough for now. Thoughts, comments, updates, ideas are very welcome. -Joan [1]: https://github.com/apache/couchdb/issues/745 [2]: https://github.com/apache/couchdb/issues/574 [3]: https://github.com/apache/couchdb/issues/1096 [4]: https://github.com/apache/couchdb/issues/1093 [5]: https://github.com/apache/couchdb/issues/1119 [6]: https://github.com/apache/couchdb/issues/1117 [7]: https://github.com/apache/couchdb/issues/1063 [8]: https://github.com/apache/couchdb/issues/1125 [9]: https://github.com/apache/couchdb/issues/1008 [10]: https://github.com/apache/couchdb/issues/1142 [11]: https://github.com/apache/couchdb/issues/1163