Hi Phil,

Here’s the thing to keep in mind about those warning messages: in a healthy 
cluster, the internal replication traffic that generates them is really just a 
redundant cross-check. It exists to “heal” a cluster member that was down 
during some write operations. When you write data into a CouchDB cluster the 
copies are written to all relevant shard replicas proactively.

If your cluster’s steady-state write load is causing internal cluster 
replication to fall behind permanently, that’s problematic. You should tune the 
cluster replication parameters to give it more throughput. If the replication 
is only falling behind during some batch data load and then catches up later it 
may be a different story. You may want to keep things configured as-is.

Does that make sense?

Cheers, Adam

> On Jun 4, 2017, at 11:06 PM, Phil May <[email protected]> wrote:
> 
> I'm writing to check whether modifying replication batch_count and
> batch_size parameters for cluster replication is good idea.
> 
> Some background – our data platform dev team noticed that under heavy write
> load, cluster replication was falling behind. The following warning
> messages started appearing in the logs, and the pending_changes value
> consistently increased while under load.
> 
> [warning] 2017-05-18T20:15:22.320498Z [email protected] <0.316.0>
> -------- mem3_sync shards/a0000000-bfffffff/test.1495137986
> [email protected]
> {pending_changes,474}
> 
> What we saw is described in COUCHDB-3421
> <https://issues.apache.org/jira/browse/COUCHDB-3421>. In addition, CouchDB
> appears to be CPU bound while this is occurring, not I/O bound as would
> seem reasonable to expect for replication.
> 
> When we looked into this, we discovered in the source two values affecting
> replication, batch_size and batch_count. For cluster replication, these
> values are fixed at 100 and 1 respectively, so we made them configurable.
> We tried various values and it seems increasing the batch_size (and to a
> lesser extent) batch_count improves our write performance. As a point of
> reference, with batch_count=50 and batch_size=5000 we can handle about
> double the write throughput with no warnings. We are experimenting with
> other values.
> 
> We wanted to know if adjusting these parameters is a sound approach.
> 
> Thanks!
> 
> - Phil

Reply via email to