[ 
https://issues.apache.org/jira/browse/COUCHDB-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Vatamaniuc closed COUCHDB-2959.
------------------------------------

> Deadlock condition in replicator with remote source and configured 1 http 
> connection
> ------------------------------------------------------------------------------------
>
>                 Key: COUCHDB-2959
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2959
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Nick Vatamaniuc
>         Attachments: rep.py
>
>
> A deadlock that occurs that can get the starting replications to get stuck 
> (and never update their state to triggered). This happens with a remote 
> source and when using a single http connection and single worker.
>  The deadlock occurs in this case:
>  - Replication process starts, it starts the changes reader: 
> https://github.com/apache/couchdb-couch-replicator/blob/master/src/couch_replicator.erl#L276
>  - Changes reader consumes the worker from httpc pool. At some point it will 
> make a call back to the replication process to report how much work it has 
> done using gen_server call {{report_seq_done}}
>  - In the meantime, main replication process calls {{get_pending_changes}} to 
> get changes from the source. If the source is remote it will attempt to 
> consumer a worker from httpc pool. However the worker is used by the change 
> feed process. So get_pending_changes is blocked waiting for a worker to be 
> released.
>  - So changes feed is waiting for report_seq_done call to replication process 
> to return while holding a worker and main replication process is waiting for 
> httpc pool to release the worker and it never responds to report_seq_done.
> Attached python script (rep.py) to reproduce issue. Script creates n 
> databases (tested with n=1000). Then replicates those databases to 1 single 
> database. It also need Python CouchDB module from pip (or package repos).
> 1. It an can be run from ipython. By importing {{rep}}. 
> 2. start dev cluster {{./dev/run --admin=adm:pass}}
> 3. {{rep.replicate_1_to_n(1000)}}
> wait....
> 4. {{rep.check_untriggered()}}
> When it fails, result might look like this:
> {code}
> {
>  'rdyno_00001_00006': None,
>  'rdyno_00001_00158': None
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to