Thanks! This will help our search for where the CPU cycles are going.

- Phil

On Mon, Jun 5, 2017 at 6:05 PM, Adam Kocoloski <[email protected]> wrote:

> The answer to your clarifying question is absolutely yes. The
> “pending_changes” metric refers to the number of committed changes on the
> shard replica emitting the log event that need to be cross-checked on
> another replica. It’s not a measure of writes that need to be executed.
>
> Cheers, Adam
>
> > On Jun 5, 2017, at 4:37 PM, Phil May <[email protected]>
> wrote:
> >
> > Hi Adam,
> >
> > Thanks for the info!
> >
> > When we run at high write rates, we will start to fall behind, but when
> we
> > reduce the rate, we eventually catch up.
> >
> > I have a clarification question – can the warning messages we are seeing
> > still occur in a healthy cluster due to the "redundant cross-check"
> taking
> > long enough that more changes have accumulated that now also need to be
> > cross-checked (even when no actual writes were needed)?
> >
> > We have had some luck modifying sync_concurrency (which is exposed in the
> > .ini file) and batch_size (which we exposed), and that does give us more
> > throughput capacity.
> >
> > Thanks!
> >
> > - Phil
> >
> >
> > On Mon, Jun 5, 2017 at 11:38 AM, Adam Kocoloski <[email protected]>
> wrote:
> >
> >> Hi Phil,
> >>
> >> Here’s the thing to keep in mind about those warning messages: in a
> >> healthy cluster, the internal replication traffic that generates them is
> >> really just a redundant cross-check. It exists to “heal” a cluster
> member
> >> that was down during some write operations. When you write data into a
> >> CouchDB cluster the copies are written to all relevant shard replicas
> >> proactively.
> >>
> >> If your cluster’s steady-state write load is causing internal cluster
> >> replication to fall behind permanently, that’s problematic. You should
> tune
> >> the cluster replication parameters to give it more throughput. If the
> >> replication is only falling behind during some batch data load and then
> >> catches up later it may be a different story. You may want to keep
> things
> >> configured as-is.
> >>
> >> Does that make sense?
> >>
> >> Cheers, Adam
> >>
> >>> On Jun 4, 2017, at 11:06 PM, Phil May <[email protected]>
> >> wrote:
> >>>
> >>> I'm writing to check whether modifying replication batch_count and
> >>> batch_size parameters for cluster replication is good idea.
> >>>
> >>> Some background – our data platform dev team noticed that under heavy
> >> write
> >>> load, cluster replication was falling behind. The following warning
> >>> messages started appearing in the logs, and the pending_changes value
> >>> consistently increased while under load.
> >>>
> >>> [warning] 2017-05-18T20:15:22.320498Z [email protected]
> <0.316.0>
> >>> -------- mem3_sync shards/a0000000-bfffffff/test.1495137986
> >>> [email protected]
> >>> {pending_changes,474}
> >>>
> >>> What we saw is described in COUCHDB-3421
> >>> <https://issues.apache.org/jira/browse/COUCHDB-3421>. In addition,
> >> CouchDB
> >>> appears to be CPU bound while this is occurring, not I/O bound as would
> >>> seem reasonable to expect for replication.
> >>>
> >>> When we looked into this, we discovered in the source two values
> >> affecting
> >>> replication, batch_size and batch_count. For cluster replication, these
> >>> values are fixed at 100 and 1 respectively, so we made them
> configurable.
> >>> We tried various values and it seems increasing the batch_size (and to
> a
> >>> lesser extent) batch_count improves our write performance. As a point
> of
> >>> reference, with batch_count=50 and batch_size=5000 we can handle about
> >>> double the write throughput with no warnings. We are experimenting with
> >>> other values.
> >>>
> >>> We wanted to know if adjusting these parameters is a sound approach.
> >>>
> >>> Thanks!
> >>>
> >>> - Phil
> >>
> >>
>
>

Reply via email to