[
https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adam Kocoloski updated COUCHDB-597:
-----------------------------------
Hi Robert, I can reproduce the crashes locally and I've discovered why they
happen independently of the {ref(), integer()} problem. The basic issue is
that attachment downloads do not employ the same retry checks that we do for
regular document GETs. For instance, the attachment receiver process
associated with a replication would be waiting an infinite amount for response
headers, when in fact it had an error message in its mailbox informing it that
the request had failed. Eventually the changes feed times out and the
replication crashes.
If I apply http://friendpaste.com/5IA5MlRx0OZhKmsLNPMeJe, crank up the changes
feed timeout, and add the catchall handle_infos we've talked about before I can
successfully run the script you posted here. We have more work to do, though,
namely
1) Reworking the changes feed timeout. Currently it will trigger if there is
no activity for X milliseconds on the connection handling the _changes feed.
There are situations where this is actually normal, since the changes feed
consumer is responsible for controlling the socket, and if the target is
_really_ slow (or the documents are huge) it's quite possible that the changes
feed will not be consulted for a long time. I think the solution is to handle
inactivity timeouts in couch_rep_changes_feed.erl instead of in the underlying
ibrowse system.
2a) Attachment retry logic that handles redirects and limits the number of
retries. Basically, the same code as we have in couch_rep_httpc, but only
applied until we receive the response headers. My friendpaste above is a
primitive form of what I'd ultimately like to see here.
2b) When an attachment body download has started and then fails, we can't
simply retry it. We need to do a Range request or find another way to skip the
first N bytes of the retry. Currently we just give up on the entire
replication if an attachment request ever fails mid-download.
> Replication tasks crash.
> ------------------------
>
> Key: COUCHDB-597
> URL: https://issues.apache.org/jira/browse/COUCHDB-597
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.11
> Reporter: Robert Newson
>
> If I kick off 10 replication tasks in quick succession, occasionally one or
> two of the replication tasks will die and not be resumed. It seems that the
> stat tracking is a little buggy, and under stress can eventually cause a
> permanent failure of the supervised replication task;
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [<0.80.0>] {error_report,<0.30.0>,
> {<0.80.0>,supervisor_report,
> [{supervisor,{local,couch_rep_sup}},
> {errorContext,shutdown_error},
> {reason,killed},
> {offender,
> [{pid,<0.6700.11>},
> {name,"fcbb13200a1618cf983b347f4d2c9835+create_target"},
> {mfa,
> {gen_server,start_link,
> [couch_rep,
> ["fcbb13200a1618cf983b347f4d2c9835",
> {[{<<"create_target">>,true},
> {<<"source">>,<<"http://node:5984/perf-p2">>},
> {<<"target">>,<<"perf-p2">>}]},
> {user_ctx,null,[<<"_admin">>]}],
> []]}},
> {restart_type,temporary},
> {shutdown,1},
> {child_type,worker}]}]}}
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process
> <0.6705.11> with exit value:
> {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.