[ 
https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Leeds updated COUCHDB-597:
----------------------------------

    Attachment: couchdb_597.patch

I believe this patch fixes most of the problems we're seeing here.

The solution, as discussed, is to remove the inactivity_timeout from options 
passed to ibrowse and handle timeouts manually (here using the timer module).

In my testing, I could mostly reproduce timeouts caused by not reading data 
from ibrowse fast enough. In other words, replicating from a remote database 
was terminating because processing the changes was taking a long time to 
complete and the socket would be inactive while couch_rep_changes_feed had a 
full queue of rows. Therefore, a timeout is not set unless the missing revs 
server is waiting for more changes.

Timeouts should still occur if the socket is idle and the local queue of 
received changes is empty. Errors should be caught appropriately such that real 
problems still bubble.

I implemented retry logic for attachments in a manner similar to 
couch_rep_httpc. I had to add some after statements now that the 
inactivity_timeout is not set.

The patch applies cleanly to trunk and 0.11.x, so please review!!! I think this 
would be a very good patch to get into 0.11 so long as Noah hasn't built the 
artifacts yet.

> Replication tasks crash.
> ------------------------
>
>                 Key: COUCHDB-597
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-597
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.11
>            Reporter: Robert Newson
>         Attachments: couchdb_597.patch
>
>
> If I kick off 10 replication tasks in quick succession, occasionally one or 
> two of the replication tasks will die and not be resumed. It seems that the 
> stat tracking is a little buggy, and under stress can eventually cause a 
> permanent failure of the supervised replication task;
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [<0.80.0>] {error_report,<0.30.0>,
>     {<0.80.0>,supervisor_report,
>      [{supervisor,{local,couch_rep_sup}},
>       {errorContext,shutdown_error},
>       {reason,killed},
>       {offender,
>           [{pid,<0.6700.11>},
>            {name,"fcbb13200a1618cf983b347f4d2c9835+create_target"},
>            {mfa,
>                {gen_server,start_link,
>                    [couch_rep,
>                     ["fcbb13200a1618cf983b347f4d2c9835",
>                      {[{<<"create_target">>,true},
>                        {<<"source">>,<<"http://node:5984/perf-p2";>>},
>                        {<<"target">>,<<"perf-p2">>}]},
>                      {user_ctx,null,[<<"_admin">>]}],
>                     []]}},
>            {restart_type,temporary},
>            {shutdown,1},
>            {child_type,worker}]}]}}
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 
> <0.6705.11> with exit value: 
> {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to