[
https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Randall Leeds updated COUCHDB-597:
----------------------------------
Attachment: couchdb_597.patch
I believe this patch fixes most of the problems we're seeing here.
The solution, as discussed, is to remove the inactivity_timeout from options
passed to ibrowse and handle timeouts manually (here using the timer module).
In my testing, I could mostly reproduce timeouts caused by not reading data
from ibrowse fast enough. In other words, replicating from a remote database
was terminating because processing the changes was taking a long time to
complete and the socket would be inactive while couch_rep_changes_feed had a
full queue of rows. Therefore, a timeout is not set unless the missing revs
server is waiting for more changes.
Timeouts should still occur if the socket is idle and the local queue of
received changes is empty. Errors should be caught appropriately such that real
problems still bubble.
I implemented retry logic for attachments in a manner similar to
couch_rep_httpc. I had to add some after statements now that the
inactivity_timeout is not set.
The patch applies cleanly to trunk and 0.11.x, so please review!!! I think this
would be a very good patch to get into 0.11 so long as Noah hasn't built the
artifacts yet.
> Replication tasks crash.
> ------------------------
>
> Key: COUCHDB-597
> URL: https://issues.apache.org/jira/browse/COUCHDB-597
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.11
> Reporter: Robert Newson
> Attachments: couchdb_597.patch
>
>
> If I kick off 10 replication tasks in quick succession, occasionally one or
> two of the replication tasks will die and not be resumed. It seems that the
> stat tracking is a little buggy, and under stress can eventually cause a
> permanent failure of the supervised replication task;
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [<0.80.0>] {error_report,<0.30.0>,
> {<0.80.0>,supervisor_report,
> [{supervisor,{local,couch_rep_sup}},
> {errorContext,shutdown_error},
> {reason,killed},
> {offender,
> [{pid,<0.6700.11>},
> {name,"fcbb13200a1618cf983b347f4d2c9835+create_target"},
> {mfa,
> {gen_server,start_link,
> [couch_rep,
> ["fcbb13200a1618cf983b347f4d2c9835",
> {[{<<"create_target">>,true},
> {<<"source">>,<<"http://node:5984/perf-p2">>},
> {<<"target">>,<<"perf-p2">>}]},
> {user_ctx,null,[<<"_admin">>]}],
> []]}},
> {restart_type,temporary},
> {shutdown,1},
> {child_type,worker}]}]}}
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process
> <0.6705.11> with exit value:
> {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.