Hello developers,

There are a few types of GitHub issues that I'd like to call out for
more attention than they are currently getting. These issues are
causing serious problems in production today for a growing number
of people whose businesses depend on CouchDB. Please help.

===================================================================

1. Failure to replicate databases with attachments

The biggest of these issues is #745 [1].  This is a situation where
CouchDB is failing to replicate databases with attachments.

Here are some facts that may help isolate the problem:

* One user I'm in touch with claims that these problems were not
  occurring in CouchDB 2.0.0-rc3; they only started seeing these
  problems after upgrading to 2.1.1. Looking at their cluster logs,
  I see odd sporadic bursts of fabric_worker_timeouts of a specific
  shard being unavailable at the same time on all 3 nodes in their
  2.1.1 cluster. I can't explain this. (See #3 below, though)

* Another user is seeing this in replicating from Cloudant to a local
  CouchDB 2.1.1 installation.

* Initially, we thought this was limited to databases that had large
  (>=64MB) attachments, but recently we have been made aware of some
  databases with no attachments larger than 2MB that are still blocked
  by this issue.

* Increasing the value of [httpd] max_http_request_size beyond the 2.1.1
  default of 67108864 (64MB) has not worked around the problem. (The
  default value of this used to be larger.)

* Technically, #745 is a recurrence of #574 [2], which was when we first
  started seeing failures of attachment replication in our automated
  test suite. In that issue, you can see where Nick Vatamane traced this
  problem to a race condition in how a 413 (request body is too large)
  error result is handled. Sadly, after some attempts to fix this, the
  replication test was disabled...

* While investigating this issue, we discovered #1117 [6], which looks
  like a memory leak but is actually a mochiweb pid leak when
  active_socket accounting fails. This bugfix needs to be backported
  into CouchDB's mochiweb fork (ideally, we should just update our
  mochiweb to the latest release, if possible.) This is possibly
  responsible for the issue Spiegel author @redgeoff identified in
  #1063 [7].

* #1125 [8] is likely an occurrence of this in trying to replicate from
  bigcouch 0.4 to CouchDB 2.1.1.

===================================================================

2. Sporadic 500 errors on writing attachments / replicating

There are a few GitHub issues for this: #1096 [3], #1093
[4]. I believe these are related. 

In #1096, a user is creating a document on a 6-node, q=16 cluster, then
immediately attempting to add an attachment to it. Sometimes, they
receive a 500 error on this attempt. After waiting ~10s and retrying the
same request, it completes successfully.

Paul Davis suggested that this user make all changes with ?w=3, and also
to increase file handles for the process; these provided temporary
relief, but did not solve the problem.

In #1093, a user is performing multiple replications at once to a
1-node, q=4 CouchDB 2.1.1 server. They are seeing 500 errors returned
during the replication process's PUTs. Under CouchDB 1.6.1, under the
same scenario, they see timeouts (but not 500s) that eventually stall
out the entire replication process.

I worked with this user to try and reproduce the problem, unsuccessfully
on my machine, but I have seen the issue on their setup directly; it is
definitely real.

===================================================================

3. Mysterious fabric/rexi timeouts / 500s on relatively unloaded systems

This one is a bit of a grab bag, but I'm seeing possibly a trend here,
if in nothing else than the fact that these timeouts are obscuring the
actual problems that are taking place. It's also worrying in that we
weren't getting any of these errors reported even just 6 months ago, and
now we are seeing them in multiple places. I wonder if we somehow
reduced various timeouts to lower thresholds and are only now seeing
these problems in the wild?

In #1119, a user is unable to GET /{db} without a 500 error
occasionally. The logs show a `no_db_file` error for various shards. I
had thought it might be related to disk speed, or to docker, but I don't
have the time to dig into this one further. (Incidentally, this user is
trying to port kazoo off of bigcouch 0.4. kazoo and the 2600 people are
long-time CouchDB users; we should do our best to support them in this
effort.)

In #1008 [9] and #1142 [10], different scenarios are leading to a
timeout in `fabric:query_view`. One is attempting to use native Erlang
views; the other is an extension to filtering for the `/{db}/_changes`
endpoint.

In #1163 [11], a user reports a `fabric_rpc_changes` timeout via
rexi_server. They are also seeing a bad `binary_to_term` that I can't
explain. This may be unrelated and just due to corrupt data.

In my own work, after standing up a new 3-node cluster on AWS (with some
VERY large nodes and placed carefully in the same AZ using placement
groups), I saw a rexi timeout on a completely idle cluster:

```
[warning] 2018-01-25T20:34:20.164277Z
couc...@ip-xxx-xx-x-xxx.eu-west-1.compute.internal <0.341.0> --------
mem3_sync shards/00000000-1fffffff/_global_changes.1516889330
couc...@ip-xxx-xx-xx-xx.eu-west-1.compute.internal
{timeout,[{mem3_rpc,rexi_call,2,[{file,[115,114,99,47,109,101,109,51,95,114,112,99,46,101,114,108]},{line,271}]},...
```

I didn't log a GH issue for this one...yet...

===================================================================


OK, that's enough for now. Thoughts, comments, updates, ideas are very
welcome.

-Joan

[1]: https://github.com/apache/couchdb/issues/745
[2]: https://github.com/apache/couchdb/issues/574
[3]: https://github.com/apache/couchdb/issues/1096
[4]: https://github.com/apache/couchdb/issues/1093
[5]: https://github.com/apache/couchdb/issues/1119
[6]: https://github.com/apache/couchdb/issues/1117
[7]: https://github.com/apache/couchdb/issues/1063
[8]: https://github.com/apache/couchdb/issues/1125
[9]: https://github.com/apache/couchdb/issues/1008
[10]: https://github.com/apache/couchdb/issues/1142
[11]: https://github.com/apache/couchdb/issues/1163

Reply via email to