GitHub user nickva opened a pull request:

    https://github.com/apache/couchdb-couch-replicator/pull/21

    Fix race condition in worker release on connection_closing state.

    This is exposed in the replicator large attachments tests case,
    replicating from local to remote. In the current test configuration
    it appears about once in 20-40 times. The failure manifests as
    up as an {error, req_timedout} exception in the logs from one of the
    PUT methods, during push replication. Then database comparison fails
    because not all documents made it to the target.
    
    Gory details:
    
    After ibrowse receives Connection: Close header it will go into
    shutdown 'connection_closing' state.
    
    couch_replicator_httpc handles that state by trying to close
    the socket and retrying, hoping that it would pick up a new worker from
    the pool on next retry in couch_replicator_httpc.erl:
    
    ```
    process_response({error, connection_closing}, Worker, HttpDb, Params, _Cb) 
->
        ...
    ```
    
    But it did not directly have a way to ensure socket is really closed,
    instead it called ibrowse_http_client:stop(Worker). That didn't wait for
    worker to die, also worker was returned back to the pool asynchronously,
    in the 'after' clause in couch_replicator_httpc:send_req/3.
    
    This worker which could still be alive but in a dying process,
    could have been picked up immediately during the retry.
    ibrowse in ibrowse:do_send_req/7 will handle a dead workers
    process as {error, req_timedout}, which is what the intermitend
    test failure showed in the log:
    
    The fix:
    
     * Make sure worker is really stopped after calling stop.
    
     * Make sure worker is returned to the pool synchronously. So that
       on retry, a worker in a known good state is picked up.
    
    COUCHDB-2833

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloudant/couchdb-couch-replicator 
2833-fix-race-condition-during-worker-termination

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/couchdb-couch-replicator/pull/21.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21
    
----
commit 307ae6d92fd300d31d6eb96bd5aca5b9a558a66d
Author: Nick Vatamaniuc <[email protected]>
Date:   2015-10-15T17:54:10Z

    Fix race condition in worker release on connection_closing state.
    
    This is exposed in the replicator large attachments tests case,
    replicating from local to remote. In the current test configuration
    it appears about once in 20-40 times. The failure manifests as
    up as an {error, req_timedout} exception in the logs from one of the
    PUT methods, during push replication. Then database comparison fails
    because not all documents made it to the target.
    
    Gory details:
    
    After ibrowse receives Connection: Close header it will go into
    shutdown 'connection_closing' state.
    
    couch_replicator_httpc handles that state by trying to close
    the socket and retrying, hoping that it would pick up a new worker from
    the pool on next retry in couch_replicator_httpc.erl:
    
    ```
    process_response({error, connection_closing}, Worker, HttpDb, Params, _Cb) 
->
        ...
    ```
    
    But it did not directly have a way to ensure socket is really closed,
    instead it called ibrowse_http_client:stop(Worker). That didn't wait for
    worker to die, also worker was returned back to the pool asynchronously,
    in the 'after' clause in couch_replicator_httpc:send_req/3.
    
    This worker which could still be alive but in a dying process,
    could have been picked up immediately during the retry.
    ibrowse in ibrowse:do_send_req/7 will handle a dead workers
    process as {error, req_timedout}, which is what the intermitend
    test failure showed in the log:
    
    The fix:
    
     * Make sure worker is really stopped after calling stop.
    
     * Make sure worker is returned to the pool synchronously. So that
       on retry, a worker in a known good state is picked up.
    
    COUCHDB-2833

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to