[jira] [Updated] (COUCHDB-1505) Error on cancelling replication - possbily related to hanging replications

Alex Markham (JIRA) Wed, 27 Jun 2012 02:31:51 -0700

     [ 
https://issues.apache.org/jira/browse/COUCHDB-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alex Markham updated COUCHDB-1505:
----------------------------------

    Attachment: replicationcancelerror1.log
                couchjs.txt
    
> Error on cancelling replication - possbily related to hanging replications
> --------------------------------------------------------------------------
>
>                 Key: COUCHDB-1505
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1505
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.2
>         Environment: CentOS 5.6 x64. WAN replication (between datacentres). 
> Cronjob controlled replication curls every 5 mins. Using pull replication 
> with a filter.
>            Reporter: Alex Markham
>              Labels: cancel, hang, replication
>         Attachments: couchjs.txt, replicationcancelerror1.log
>
>
> We run a cronjob to cancel replication, and then start it again every 5 
> minutes. Occasionally when cancelling replication jobs, a stack trace appears 
> in the couchdb log (attached)
> Other observations : perhaps unrelated, but over time we slowly start to 
> gather "zombie" couchjs processes. After a month or so (different for each 
> server) we start to get up to near our os_process_limit of 200 and we restart 
> couchdb. "zombie" is speculation here, but there seems to be no need for the 
> hundred+ couchjs processes when just replicating 10 databases and occasional 
> indexing, after restart it drops right back down. The started time of those 
> processes are also weeks old. This may be normal, not sure.
> Why do we cancel replication and restart it? We found that if we don't do 
> this then WAN replications can hang, where curling /_replicate would say that 
> the continuous replication is already running, but that the replications were 
> not updating, and the document counts in the databases would diverge. 
> Immediately after re-enabling the "cancel":true /_replicate beforehand, these 
> stack traces re-appeared and the replication caught up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (COUCHDB-1505) Error on cancelling replication - possbily related to hanging replications

Reply via email to