Thanks! This will help our search for where the CPU cycles are going. - Phil
On Mon, Jun 5, 2017 at 6:05 PM, Adam Kocoloski <[email protected]> wrote: > The answer to your clarifying question is absolutely yes. The > “pending_changes” metric refers to the number of committed changes on the > shard replica emitting the log event that need to be cross-checked on > another replica. It’s not a measure of writes that need to be executed. > > Cheers, Adam > > > On Jun 5, 2017, at 4:37 PM, Phil May <[email protected]> > wrote: > > > > Hi Adam, > > > > Thanks for the info! > > > > When we run at high write rates, we will start to fall behind, but when > we > > reduce the rate, we eventually catch up. > > > > I have a clarification question – can the warning messages we are seeing > > still occur in a healthy cluster due to the "redundant cross-check" > taking > > long enough that more changes have accumulated that now also need to be > > cross-checked (even when no actual writes were needed)? > > > > We have had some luck modifying sync_concurrency (which is exposed in the > > .ini file) and batch_size (which we exposed), and that does give us more > > throughput capacity. > > > > Thanks! > > > > - Phil > > > > > > On Mon, Jun 5, 2017 at 11:38 AM, Adam Kocoloski <[email protected]> > wrote: > > > >> Hi Phil, > >> > >> Here’s the thing to keep in mind about those warning messages: in a > >> healthy cluster, the internal replication traffic that generates them is > >> really just a redundant cross-check. It exists to “heal” a cluster > member > >> that was down during some write operations. When you write data into a > >> CouchDB cluster the copies are written to all relevant shard replicas > >> proactively. > >> > >> If your cluster’s steady-state write load is causing internal cluster > >> replication to fall behind permanently, that’s problematic. You should > tune > >> the cluster replication parameters to give it more throughput. If the > >> replication is only falling behind during some batch data load and then > >> catches up later it may be a different story. You may want to keep > things > >> configured as-is. > >> > >> Does that make sense? > >> > >> Cheers, Adam > >> > >>> On Jun 4, 2017, at 11:06 PM, Phil May <[email protected]> > >> wrote: > >>> > >>> I'm writing to check whether modifying replication batch_count and > >>> batch_size parameters for cluster replication is good idea. > >>> > >>> Some background – our data platform dev team noticed that under heavy > >> write > >>> load, cluster replication was falling behind. The following warning > >>> messages started appearing in the logs, and the pending_changes value > >>> consistently increased while under load. > >>> > >>> [warning] 2017-05-18T20:15:22.320498Z [email protected] > <0.316.0> > >>> -------- mem3_sync shards/a0000000-bfffffff/test.1495137986 > >>> [email protected] > >>> {pending_changes,474} > >>> > >>> What we saw is described in COUCHDB-3421 > >>> <https://issues.apache.org/jira/browse/COUCHDB-3421>. In addition, > >> CouchDB > >>> appears to be CPU bound while this is occurring, not I/O bound as would > >>> seem reasonable to expect for replication. > >>> > >>> When we looked into this, we discovered in the source two values > >> affecting > >>> replication, batch_size and batch_count. For cluster replication, these > >>> values are fixed at 100 and 1 respectively, so we made them > configurable. > >>> We tried various values and it seems increasing the batch_size (and to > a > >>> lesser extent) batch_count improves our write performance. As a point > of > >>> reference, with batch_count=50 and batch_size=5000 we can handle about > >>> double the write throughput with no warnings. We are experimenting with > >>> other values. > >>> > >>> We wanted to know if adjusting these parameters is a sound approach. > >>> > >>> Thanks! > >>> > >>> - Phil > >> > >> > >
