[ https://issues.apache.org/jira/browse/COUCHDB-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Vatamaniuc closed COUCHDB-2959. ------------------------------------ > Deadlock condition in replicator with remote source and configured 1 http > connection > ------------------------------------------------------------------------------------ > > Key: COUCHDB-2959 > URL: https://issues.apache.org/jira/browse/COUCHDB-2959 > Project: CouchDB > Issue Type: Bug > Components: Replication > Reporter: Nick Vatamaniuc > Attachments: rep.py > > > A deadlock that occurs that can get the starting replications to get stuck > (and never update their state to triggered). This happens with a remote > source and when using a single http connection and single worker. > The deadlock occurs in this case: > - Replication process starts, it starts the changes reader: > https://github.com/apache/couchdb-couch-replicator/blob/master/src/couch_replicator.erl#L276 > - Changes reader consumes the worker from httpc pool. At some point it will > make a call back to the replication process to report how much work it has > done using gen_server call {{report_seq_done}} > - In the meantime, main replication process calls {{get_pending_changes}} to > get changes from the source. If the source is remote it will attempt to > consumer a worker from httpc pool. However the worker is used by the change > feed process. So get_pending_changes is blocked waiting for a worker to be > released. > - So changes feed is waiting for report_seq_done call to replication process > to return while holding a worker and main replication process is waiting for > httpc pool to release the worker and it never responds to report_seq_done. > Attached python script (rep.py) to reproduce issue. Script creates n > databases (tested with n=1000). Then replicates those databases to 1 single > database. It also need Python CouchDB module from pip (or package repos). > 1. It an can be run from ipython. By importing {{rep}}. > 2. start dev cluster {{./dev/run --admin=adm:pass}} > 3. {{rep.replicate_1_to_n(1000)}} > wait.... > 4. {{rep.check_untriggered()}} > When it fails, result might look like this: > {code} > { > 'rdyno_00001_00006': None, > 'rdyno_00001_00158': None > } > {code} > -- This message was sent by Atlassian JIRA (v6.3.4#6332)