[
https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850335#action_12850335
]
Randall Leeds commented on COUCHDB-597:
---------------------------------------
Re-opening. Still happening on 0.11, but for different reasons.
Germain reports this from his log on the user@ list:
[Fri, 26 Mar 2010 09:55:01 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc
post request in 16.0 seconds due to {error, req_timedout}
[Fri, 26 Mar 2010 09:56:13 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc
post request in 32.0 seconds due to {error, req_timedout}
[Fri, 26 Mar 2010 09:57:42 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc
post request in 64.0 seconds due to {error, req_timedout}
[Fri, 26 Mar 2010 09:59:40 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc
post request in 128.0 seconds due to {error, req_timedout}
In my experience with this in production, I've seen put requests get stalled
here writing the checkpoint document. I'm guessing the log above is the
_ensure_full_commit failing (since that's the only post in replication I
think). In my logs I see 409 conflicts writing the remote checkpoint document
but only timeouts on the receiving side of those conflict PUTs. I'm not sure
why the conflict doesn't bubble up to couch_rep. My first guess is that maybe
we're not asking ibrowse to stream the next chunk in some code path and the
remote side has sent a response that we never retrieve.
> Replication tasks crash.
> ------------------------
>
> Key: COUCHDB-597
> URL: https://issues.apache.org/jira/browse/COUCHDB-597
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.11
> Reporter: Robert Newson
> Fix For: 0.11
>
> Attachments:
> 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch,
> 0001-Cleanup-597-fixes.patch, 597_fixes.patch, couchdb_597.patch
>
>
> If I kick off 10 replication tasks in quick succession, occasionally one or
> two of the replication tasks will die and not be resumed. It seems that the
> stat tracking is a little buggy, and under stress can eventually cause a
> permanent failure of the supervised replication task;
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [<0.80.0>] {error_report,<0.30.0>,
> {<0.80.0>,supervisor_report,
> [{supervisor,{local,couch_rep_sup}},
> {errorContext,shutdown_error},
> {reason,killed},
> {offender,
> [{pid,<0.6700.11>},
> {name,"fcbb13200a1618cf983b347f4d2c9835+create_target"},
> {mfa,
> {gen_server,start_link,
> [couch_rep,
> ["fcbb13200a1618cf983b347f4d2c9835",
> {[{<<"create_target">>,true},
> {<<"source">>,<<"http://node:5984/perf-p2">>},
> {<<"target">>,<<"perf-p2">>}]},
> {user_ctx,null,[<<"_admin">>]}],
> []]}},
> {restart_type,temporary},
> {shutdown,1},
> {child_type,worker}]}]}}
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process
> <0.6705.11> with exit value:
> {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.