[GitHub] nickva commented on issue #1081: Replicator infinite failure loop

GitBox Tue, 16 Jan 2018 08:08:21 -0800

nickva commented on issue #1081: Replicator infinite failure loop
URL: https://github.com/apache/couchdb/issues/1081#issuecomment-358012819

Hi Avaq,

Thanks for your report.

Noticed in the test behavior script you specified a heartbeat. In 2.x
replicator doesn't use hearbeats, instead it uses timeouts:

https://github.com/apache/couchdb/blob/master/src/couch_replicator/src/couch_replicator_api_wrap.erl#L486

Notice that it uses a timeout for the changes feed and the value of the
timeout is 1/3 of the `connection_timeout`. By default connection timeout is
30s so the timeout for the _changes feed ends up being 10s.

Try re-running test script with a timeout parameter specified instead
instead of a heartbeat.

I just tested it a few days ago investigating a similar issue in 2.1.x and
noticed that server responds quickly with a `results` and periodic newlines are
being sent, keeping the connection alive. In my case I was also looking at a
continuous change feed (because the replication was a continuous one as well).
Wonder if there is a difference in behavior between a continuous and a normal
one in respect to filters.

Besides the timeout vs heartbeat, and continuous vs normal, a few more
questions to get a better idea of what's happening:

* To double check, is the replication itself running on a 2.x cluster? What
are the versions of the targets and source? Are they all 2.x as well?

* Are there any proxies or load balancers involved and do you think they
could affect the connections?

* How many replication jobs are running? CouchDB 2.x uses a scheduling
replicator with a default maximum number of jobs set to 500. If there are more
than 500 some tasks will be stopped and some started periodically. In case of
filtered replications, with large source db and a restrictive filter, like you
have, replications won't checkpoint unless they receive a document update via
the filter. However if it takes too long and the job is swapped out by the
scheduler, it might not have chance to checkpoint, it will be stopped. Next
time starts will use 0 for the changes feed start 0, and it will wait again,
not get a document, will be stopped, etc. In this case you can try for example
to increase max_jobs to a number high enough to fit all the replications jobs
you have.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] nickva commented on issue #1081: Replicator infinite failure loop

Reply via email to