[ https://issues.apache.org/jira/browse/COUCHDB-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977602#comment-15977602 ]
ASF subversion and git services commented on COUCHDB-3324: ---------------------------------------------------------- Commit 87f4ca0454909466b068d9c10c11467a3e85d3cb in couchdb's branch refs/heads/63012-scheduler from [~vatamane] [ https://gitbox.apache.org/repos/asf?p=couchdb.git;h=87f4ca0 ] Implement replication document processor Document processor listens for `_replicator` db document updates, parses those changes then tries to add replication jobs to the scheduler. Listening for changes happens in `couch_multidb_changes module`. That module is generic and is set up to listen to shards with `_replicator` suffix by `couch_replicator_db_changes`. Updates are then passed to the document processor's `process_change/2` function. Document replication ID calculation, which can involve fetching filter code from the source DB, and addition to the scheduler, is done in a separate worker process: `couch_replicator_doc_processor_worker`. Before couch replicator manager did most of this work. There are a few improvement over previous implementation: * Invalid (malformed) replication documents are immediately failed and will not be continuously retried. * Replication manager message queue backups is unfortunately a common issue in production. This is because processing document updates is a serial (blocking) operation. Most of that blocking code was moved to separate worker processes. * Failing filter fetches have an exponential backoff. * Replication documents don't have to be deleted first then re-added in order update the replication. Document processor on update will compare new and previous replication related document fields and update the replication job if those changed. Users can freely update unlrelated (custom) fields in their replication docs. * In case of filtered replications using custom functions, document processor will periodically check if filter code on the source has changed. Filter code contents is factored into replication ID calculation. If filter code changes replication ID will change as well. Jira: COUCHDB-3324 > Scheduling Replicator > --------------------- > > Key: COUCHDB-3324 > URL: https://issues.apache.org/jira/browse/COUCHDB-3324 > Project: CouchDB > Issue Type: New Feature > Reporter: Nick Vatamaniuc > > Improve CouchDB replicator > * Allow running a large number of replication jobs > * Improve API with a focus on ease of use and performance. Avoid updating > replication document with transient state updates. Instead create a proper > API for querying replication states. At the same time provide a compatibility > mode to let users keep existing behavior (of getting updates in documents). > * Improve network resource usage and performance. Multiple connection to the > same cluster could share socket connections > * Handle rate limiting on target and source HTTP endpoints. Let replication > request auto-discover rate limit capacity based on a proven algorithm such as > Additive Increase / Multiplicative Decrease feedback control loop. > * Improve performance by avoiding repeatedly retrying failing replication > jobs. Instead use exponential backoff. > * Improve recovery from long (but temporary) network failure. Currently if > replications jobs fail to start 10 times in a row they will not be retried > anymore. This is not always desirable. In case of a long enough DNS (or other > network) failure replication jobs will effectively stop until they are > manually restarted. > * Better handling of filtered replications: Failing to fetch filters could > block couch replicator manager, lead to message queue backups and memory > exhaustion. Also, when replication filter code changes update replication > accordingly (replication job ID should change in that case) > * Provide better metrics to introspect replicator behavior. -- This message was sent by Atlassian JIRA (v6.3.15#6346)