[jira] Updated: (COUCHDB-597) Replication tasks crash.

Adam Kocoloski (JIRA) Sat, 19 Dec 2009 19:00:54 -0800

     [ 
https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adam Kocoloski updated COUCHDB-597:
-----------------------------------


Hi Robert, I can reproduce the crashes locally and I've discovered why they 
happen independently of the {ref(), integer()} problem.  The basic issue is 
that attachment downloads do not employ the same retry checks that we do for 
regular document GETs.  For instance, the attachment receiver process 
associated with a replication would be waiting an infinite amount for response 
headers, when in fact it had an error message in its mailbox informing it that 
the request had failed.  Eventually the changes feed times out and the 
replication crashes.

If I apply http://friendpaste.com/5IA5MlRx0OZhKmsLNPMeJe, crank up the changes 
feed timeout, and add the catchall handle_infos we've talked about before I can 
successfully run the script you posted here.  We have more work to do, though, 
namely

1) Reworking the changes feed timeout.  Currently it will trigger if there is 
no activity for X milliseconds on the connection handling the _changes feed.  
There are situations where this is actually normal, since the changes feed 
consumer is responsible for controlling the socket, and if the target is 
_really_ slow (or the documents are huge) it's quite possible that the changes 
feed will not be consulted for a long time.  I think the solution is to handle 
inactivity timeouts in couch_rep_changes_feed.erl instead of in the underlying 
ibrowse system.

2a) Attachment retry logic that handles redirects and limits the number of 
retries.  Basically, the same code as we have in couch_rep_httpc, but only 
applied until we receive the response headers.  My friendpaste above is a 
primitive form of what I'd ultimately like to see here.

2b) When an attachment body download has started and then fails, we can't 
simply retry it.  We need to do a Range request or find another way to skip the 
first N bytes of the retry.  Currently we just give up on the entire 
replication if an attachment request ever fails mid-download.

> Replication tasks crash.
> ------------------------
>
>                 Key: COUCHDB-597
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-597
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.11
>            Reporter: Robert Newson
>
> If I kick off 10 replication tasks in quick succession, occasionally one or 
> two of the replication tasks will die and not be resumed. It seems that the 
> stat tracking is a little buggy, and under stress can eventually cause a 
> permanent failure of the supervised replication task;
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [<0.80.0>] {error_report,<0.30.0>,
>     {<0.80.0>,supervisor_report,
>      [{supervisor,{local,couch_rep_sup}},
>       {errorContext,shutdown_error},
>       {reason,killed},
>       {offender,
>           [{pid,<0.6700.11>},
>            {name,"fcbb13200a1618cf983b347f4d2c9835+create_target"},
>            {mfa,
>                {gen_server,start_link,
>                    [couch_rep,
>                     ["fcbb13200a1618cf983b347f4d2c9835",
>                      {[{<<"create_target">>,true},
>                        {<<"source">>,<<"http://node:5984/perf-p2";>>},
>                        {<<"target">>,<<"perf-p2">>}]},
>                      {user_ctx,null,[<<"_admin">>]}],
>                     []]}},
>            {restart_type,temporary},
>            {shutdown,1},
>            {child_type,worker}]}]}}
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 
> <0.6705.11> with exit value: 
> {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (COUCHDB-597) Replication tasks crash.

Reply via email to