[jira] Commented: (COUCHDB-597) Replication tasks crash.

Randall Leeds (JIRA) Fri, 26 Mar 2010 13:24:50 -0700

    [ 
https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850335#action_12850335
 ]


Randall Leeds commented on COUCHDB-597:
---------------------------------------

Re-opening. Still happening on 0.11, but for different reasons.

Germain reports this from his log on the user@ list:

[Fri, 26 Mar 2010 09:55:01 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc 
post request in 16.0 seconds due to {error, req_timedout}
[Fri, 26 Mar 2010 09:56:13 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc 
post request in 32.0 seconds due to {error, req_timedout}
[Fri, 26 Mar 2010 09:57:42 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc 
post request in 64.0 seconds due to {error, req_timedout}
[Fri, 26 Mar 2010 09:59:40 GMT] [debug] [<0.2466.0>] retrying couch_rep_httpc 
post request in 128.0 seconds due to {error, req_timedout}

In my experience with this in production, I've seen put requests get stalled 
here writing the checkpoint document. I'm guessing the log above is the 
_ensure_full_commit failing (since that's the only post in replication I 
think). In my logs I see 409 conflicts writing the remote checkpoint document 
but only timeouts on the receiving side of those conflict PUTs. I'm not sure 
why the conflict doesn't bubble up to couch_rep. My first guess is that maybe 
we're not asking ibrowse to stream the next chunk in some code path and the 
remote side has sent a response that we never retrieve.

> Replication tasks crash.
> ------------------------
>
>                 Key: COUCHDB-597
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-597
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.11
>            Reporter: Robert Newson
>             Fix For: 0.11
>
>         Attachments: 
> 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch, 
> 0001-Cleanup-597-fixes.patch, 597_fixes.patch, couchdb_597.patch
>
>
> If I kick off 10 replication tasks in quick succession, occasionally one or 
> two of the replication tasks will die and not be resumed. It seems that the 
> stat tracking is a little buggy, and under stress can eventually cause a 
> permanent failure of the supervised replication task;
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [<0.80.0>] {error_report,<0.30.0>,
>     {<0.80.0>,supervisor_report,
>      [{supervisor,{local,couch_rep_sup}},
>       {errorContext,shutdown_error},
>       {reason,killed},
>       {offender,
>           [{pid,<0.6700.11>},
>            {name,"fcbb13200a1618cf983b347f4d2c9835+create_target"},
>            {mfa,
>                {gen_server,start_link,
>                    [couch_rep,
>                     ["fcbb13200a1618cf983b347f4d2c9835",
>                      {[{<<"create_target">>,true},
>                        {<<"source">>,<<"http://node:5984/perf-p2";>>},
>                        {<<"target">>,<<"perf-p2">>}]},
>                      {user_ctx,null,[<<"_admin">>]}],
>                     []]}},
>            {restart_type,temporary},
>            {shutdown,1},
>            {child_type,worker}]}]}}
> [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 
> <0.6705.11> with exit value: 
> {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (COUCHDB-597) Replication tasks crash.

Reply via email to