Hi Adam, Thanks for the info!
When we run at high write rates, we will start to fall behind, but when we reduce the rate, we eventually catch up. I have a clarification question – can the warning messages we are seeing still occur in a healthy cluster due to the "redundant cross-check" taking long enough that more changes have accumulated that now also need to be cross-checked (even when no actual writes were needed)? We have had some luck modifying sync_concurrency (which is exposed in the .ini file) and batch_size (which we exposed), and that does give us more throughput capacity. Thanks! - Phil On Mon, Jun 5, 2017 at 11:38 AM, Adam Kocoloski <[email protected]> wrote: > Hi Phil, > > Here’s the thing to keep in mind about those warning messages: in a > healthy cluster, the internal replication traffic that generates them is > really just a redundant cross-check. It exists to “heal” a cluster member > that was down during some write operations. When you write data into a > CouchDB cluster the copies are written to all relevant shard replicas > proactively. > > If your cluster’s steady-state write load is causing internal cluster > replication to fall behind permanently, that’s problematic. You should tune > the cluster replication parameters to give it more throughput. If the > replication is only falling behind during some batch data load and then > catches up later it may be a different story. You may want to keep things > configured as-is. > > Does that make sense? > > Cheers, Adam > > > On Jun 4, 2017, at 11:06 PM, Phil May <[email protected]> > wrote: > > > > I'm writing to check whether modifying replication batch_count and > > batch_size parameters for cluster replication is good idea. > > > > Some background – our data platform dev team noticed that under heavy > write > > load, cluster replication was falling behind. The following warning > > messages started appearing in the logs, and the pending_changes value > > consistently increased while under load. > > > > [warning] 2017-05-18T20:15:22.320498Z [email protected] <0.316.0> > > -------- mem3_sync shards/a0000000-bfffffff/test.1495137986 > > [email protected] > > {pending_changes,474} > > > > What we saw is described in COUCHDB-3421 > > <https://issues.apache.org/jira/browse/COUCHDB-3421>. In addition, > CouchDB > > appears to be CPU bound while this is occurring, not I/O bound as would > > seem reasonable to expect for replication. > > > > When we looked into this, we discovered in the source two values > affecting > > replication, batch_size and batch_count. For cluster replication, these > > values are fixed at 100 and 1 respectively, so we made them configurable. > > We tried various values and it seems increasing the batch_size (and to a > > lesser extent) batch_count improves our write performance. As a point of > > reference, with batch_count=50 and batch_size=5000 we can handle about > > double the write throughput with no warnings. We are experimenting with > > other values. > > > > We wanted to know if adjusting these parameters is a sound approach. > > > > Thanks! > > > > - Phil > >
