nickva commented on issue #1081: Replicator infinite failure loop
URL: https://github.com/apache/couchdb/issues/1081#issuecomment-358012819
 
 
   Hi Avaq,
   
   Thanks for your report.
   
   Noticed in the test behavior script you specified a heartbeat. In 2.x 
replicator doesn't use hearbeats, instead it uses timeouts:
   
   
https://github.com/apache/couchdb/blob/master/src/couch_replicator/src/couch_replicator_api_wrap.erl#L486
   
   Notice that it uses a timeout for the changes feed and the value of the 
timeout is 1/3 of the `connection_timeout`. By default connection timeout is 
30s so the timeout for the _changes feed ends up being 10s.
   
   Try re-running test script with a timeout parameter specified instead 
instead of a heartbeat. 
   
   I just tested it a few days ago investigating a similar issue in 2.1.x and 
noticed that server responds quickly with a `results` and periodic newlines are 
being sent, keeping the connection alive. In my case I was also looking at a 
continuous change feed (because the replication was a continuous one as well). 
Wonder if there is a difference in behavior between a continuous and a normal 
one in respect to filters. 
   
   Besides the timeout vs heartbeat, and continuous vs normal, a few more 
questions to get a better idea of what's happening:
   
    * To double check, is the replication itself running on a 2.x cluster? What 
are the versions of the targets and source? Are they all 2.x as well?
   
    * Are there any proxies or load balancers involved and do you think they 
could affect the connections?
   
    * How many replication jobs are running? CouchDB 2.x uses a scheduling 
replicator with a default maximum number of jobs set to 500. If there are more 
than 500 some tasks will be stopped and some started periodically. In case of 
filtered replications, with large source db and a restrictive filter, like you 
have, replications won't checkpoint unless they receive a document update via 
the filter. However if it takes too long and the job is swapped out by the 
scheduler, it might not have  chance to checkpoint, it will be stopped. Next 
time starts will use 0 for the changes feed start 0, and it will wait again, 
not get a document, will be stopped, etc. In this case you can try for example 
to increase max_jobs to a number high enough to fit all the replications jobs 
you have.
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to