Replication hanging/failing on docs with lots of revisions
----------------------------------------------------------

                 Key: COUCHDB-1364
                 URL: https://issues.apache.org/jira/browse/COUCHDB-1364
             Project: CouchDB
          Issue Type: Bug
          Components: Replication
    Affects Versions: 1.1.1, 1.0.3
         Environment: Centos 5.6/x64 spidermonkey 1.8.5, couchdb 1.1.1 patched 
for COUCHDB-1340 and COUCHDB-1333
            Reporter: Alex Markham


We have a setup where replication from a 1.1.1 couch is hanging - this is WAN 
replication which previously worked 1.0.3 <-> 1.0.3.

Replicating from the 1.1.1 -> 1.0.3 showed an error very similar to 
COUCHDB-1340 - which I presumed meant the url was too long. So I upgraded the 
1.0.3 couch to our 1.1.1 build which had this patched.

However - the replication between the 2 1.1.1 couches is hanging at a certain 
point when doing continuous pull replication - it doesn't checkpoint, just 
stays on "starting" however, when cancelled and restarted it gets the latest 
documents (so doc counts are equal). The last calls I see to the source db when 
it hangs are multiple long GETs for a document with 2051 open revisions on the 
source and 498 on the target.

When doing a push replication the _replicate call just gives a 500 error (at 
about the same seq id as the pull replication hangs at) saying:

[Thu, 15 Dec 2011 10:09:17 GMT] [error] [<0.11306.115>] changes_loop died with 
reason {noproc,
                                                       {gen_server,call,
                                                        [<0.6382.115>,
                                                         {pread_iolist,
                                                          79043596434},
                                                         infinity]}}

when the last call in the target of the push replication is:

[Thu, 15 Dec 2011 10:09:17 GMT] [info] [<0.580.50>] 10.35.9.79 - - 'POST' 
/master_db/_missing_revs 200

with no stack trace.


Comparing the open_revs=all count on the documents with many open revs shows 
differing numbers on each side of the replication WAN and between different 
couches in the same datacentre. Some of these documents have not been updated 
for months. Is it possible that 1.0.3 just skipped over this issue and carried 
on replicating, but 1.1.1 does not?

I know I can hack the replication to work by updating the checkpoint seq past 
this point in the _local document, but I think there is a real bug here 
somewhere.

If wireshark/debug data is required, please say



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to