[jira] [Updated] (COUCHDB-2240) The replication manager should be smarter

Robert Newson (JIRA) Sat, 17 May 2014 02:33:08 -0700

     [ 
https://issues.apache.org/jira/browse/COUCHDB-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Newson updated COUCHDB-2240:
-----------------------------------

    Issue Type: New Feature  (was: Bug)
       Summary: The replication manager should be smarter  (was: Many 
continuous replications cause DOS)

The original title and issue type really amount to an acknowledgment that a 
server can be overwhelmed by client load, which is true of many things.

I've adapted the ticket to address the real problem, that the code that manages 
the _replicator databases insists on running all the jobs simultaneously. This 
should be configurable, and the replicator manager should cycle through jobs in 
some fashion to ensure all replications make progress.

When I pondered this before, I figured the smart thing to do for any 
continuous:true document in the _replicator database was to run each of them 
repeatedly without the continuous:true flag.

We might also go further and support different priority levels or ToS flags but 
the first version should simply break the 1-for-1 nature of _replicator.

> The replication manager should be smarter
> -----------------------------------------
>
>                 Key: COUCHDB-2240
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2240
>             Project: CouchDB
>          Issue Type: New Feature
>      Security Level: public(Regular issues) 
>            Reporter: Eli Stevens
>
> Currently, I can configure an arbitrary number of replications between 
> localhost DBs (in my case, they are in the _replicator DB with continuous set 
> to true). However, there is a limit beyond which requests to the DB start to 
> fail.  Trying to do another replication fails with the error:
> ServerError: (500, ('checkpoint_commit_failure', "Target database out of 
> sync. Try to increase max_dbs_open at the target's server."))
> Due to COUCHDB-2239, it's not clear what the actual issue is. 
> I also believe that while the DB was in this state GET requests to documents 
> were also failing, but the machine that has the logs of this has already had 
> it's drives wiped. If need be, I can recreate the situation and provide those 
> logs as well.
> I think that instead of there being a single fixed pool of resources that 
> cause errors when exhausted, the system should have a per-task-type pool of 
> resources that result in performance degradation when exhausted. N 
> replication workers with P DB connections, and if that's not enough they 
> start to round-robin; that sort of thing. When a user has too much to 
> replicate, it gets slow instead of failing.
> As it stands now, I have a potentially large number of continuous 
> replications that produce a fixed rate of data to replicate (because there's 
> a fixed application worker pool that writes the data in the first place). We 
> use a DB+replication per batch of data to process, and if we receive a burst 
> of batches, then couchdb starts failing. The current setup means that I'm 
> always going to be playing chicken between burst size and whatever setting 
> limit we're hitting.  That sucks, and isn't acceptable for a production 
> system, so we're going to have to re-architect how we do replication, and 
> basically implement poor-man's continuous by doing one off replications at 
> various points of our data processing runs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (COUCHDB-2240) The replication manager should be smarter

Reply via email to