GitHub user nickva opened a pull request:
https://github.com/apache/couchdb-couch-replicator/pull/21
Fix race condition in worker release on connection_closing state.
This is exposed in the replicator large attachments tests case,
replicating from local to remote. In the current test configuration
it appears about once in 20-40 times. The failure manifests as
up as an {error, req_timedout} exception in the logs from one of the
PUT methods, during push replication. Then database comparison fails
because not all documents made it to the target.
Gory details:
After ibrowse receives Connection: Close header it will go into
shutdown 'connection_closing' state.
couch_replicator_httpc handles that state by trying to close
the socket and retrying, hoping that it would pick up a new worker from
the pool on next retry in couch_replicator_httpc.erl:
```
process_response({error, connection_closing}, Worker, HttpDb, Params, _Cb)
->
...
```
But it did not directly have a way to ensure socket is really closed,
instead it called ibrowse_http_client:stop(Worker). That didn't wait for
worker to die, also worker was returned back to the pool asynchronously,
in the 'after' clause in couch_replicator_httpc:send_req/3.
This worker which could still be alive but in a dying process,
could have been picked up immediately during the retry.
ibrowse in ibrowse:do_send_req/7 will handle a dead workers
process as {error, req_timedout}, which is what the intermitend
test failure showed in the log:
The fix:
* Make sure worker is really stopped after calling stop.
* Make sure worker is returned to the pool synchronously. So that
on retry, a worker in a known good state is picked up.
COUCHDB-2833
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloudant/couchdb-couch-replicator
2833-fix-race-condition-during-worker-termination
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/couchdb-couch-replicator/pull/21.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21
----
commit 307ae6d92fd300d31d6eb96bd5aca5b9a558a66d
Author: Nick Vatamaniuc <[email protected]>
Date: 2015-10-15T17:54:10Z
Fix race condition in worker release on connection_closing state.
This is exposed in the replicator large attachments tests case,
replicating from local to remote. In the current test configuration
it appears about once in 20-40 times. The failure manifests as
up as an {error, req_timedout} exception in the logs from one of the
PUT methods, during push replication. Then database comparison fails
because not all documents made it to the target.
Gory details:
After ibrowse receives Connection: Close header it will go into
shutdown 'connection_closing' state.
couch_replicator_httpc handles that state by trying to close
the socket and retrying, hoping that it would pick up a new worker from
the pool on next retry in couch_replicator_httpc.erl:
```
process_response({error, connection_closing}, Worker, HttpDb, Params, _Cb)
->
...
```
But it did not directly have a way to ensure socket is really closed,
instead it called ibrowse_http_client:stop(Worker). That didn't wait for
worker to die, also worker was returned back to the pool asynchronously,
in the 'after' clause in couch_replicator_httpc:send_req/3.
This worker which could still be alive but in a dying process,
could have been picked up immediately during the retry.
ibrowse in ibrowse:do_send_req/7 will handle a dead workers
process as {error, req_timedout}, which is what the intermitend
test failure showed in the log:
The fix:
* Make sure worker is really stopped after calling stop.
* Make sure worker is returned to the pool synchronously. So that
on retry, a worker in a known good state is picked up.
COUCHDB-2833
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---