[jira] [Created] (COUCHDB-1505) Error on cancelling replication - possbily related to hanging replications

Alex Markham (JIRA) Wed, 27 Jun 2012 02:31:51 -0700

Alex Markham created COUCHDB-1505:
-------------------------------------

             Summary: Error on cancelling replication - possbily related to 
hanging replications
                 Key: COUCHDB-1505
                 URL: https://issues.apache.org/jira/browse/COUCHDB-1505
             Project: CouchDB
          Issue Type: Bug
          Components: Replication
    Affects Versions: 1.2
         Environment: CentOS 5.6 x64. WAN replication (between datacentres). 
Cronjob controlled replication curls every 5 mins. Using pull replication with 
a filter.
            Reporter: Alex Markham
         Attachments: couchjs.txt, replicationcancelerror1.log


We run a cronjob to cancel replication, and then start it again every 5 
minutes. Occasionally when cancelling replication jobs, a stack trace appears 
in the couchdb log (attached)

Other observations : perhaps unrelated, but over time we slowly start to gather 
"zombie" couchjs processes. After a month or so (different for each server) we 
start to get up to near our os_process_limit of 200 and we restart couchdb. 
"zombie" is speculation here, but there seems to be no need for the hundred+ 
couchjs processes when just replicating 10 databases and occasional indexing, 
after restart it drops right back down. The started time of those processes are 
also weeks old. This may be normal, not sure.

Why do we cancel replication and restart it? We found that if we don't do this 
then WAN replications can hang, where curling /_replicate would say that the 
continuous replication is already running, but that the replications were not 
updating, and the document counts in the databases would diverge. Immediately 
after re-enabling the "cancel":true /_replicate beforehand, these stack traces 
re-appeared and the replication caught up.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (COUCHDB-1505) Error on cancelling replication - possbily related to hanging replications

Reply via email to