[
https://issues.apache.org/jira/browse/COUCHDB-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14941124#comment-14941124
]
Nick Vatamaniuc commented on COUCHDB-2833:
------------------------------------------
This is the output log from the failing test case. It has added debug
statements throughout the replicator and the test.
{code}
%% couch_replicator_large_atts_tests.erl changes
%%
%% -define(ATT_SIZE_1, 65537).
%% -define(ATT_SIZE_2, 1).
%% -define(DOCS_COUNT, 2).
%% ...
%% large_atts_test_() ->
%% Pairs = [
%% %{local, local},
%% {local, remote}
%% %{remote, local},
%% %{remote, remote}
%% ],
$ rebar -r eunit apps=couch_replicator tests=large_atts_test_
skip_deps=couch_epi
==> couch_replicator (eunit)
Compiled test/couch_replicator_test_helper.erl
/Users/nvatama/asf/couchdb/src/couch_replicator/test/couch_replicator_large_atts_tests.erl:124:
Warning: variable 'Res' is unused
Compiled test/couch_replicator_large_atts_tests.erl
Compiled test/couch_replicator_missing_stubs_tests.erl
Running test function(s):
couch_replicator_large_atts_tests:large_atts_test_/0
======================== EUnit ========================
01:12:38.699 [notice] config: [attachments] compressible_types set to text/*
for reason nil
01:12:38.699 [info] Application couch_replicator started on node nonode@nohost
test/couch_replicator_large_atts_tests.erl:81:<0.166.0>: $xxxxxx From local
<<"eunit-test-db-1443762758699299">> ==> remote {remote,
<<"eunit-test-db-1443762758720227">>}
test/couch_replicator_large_atts_tests.erl:98:<0.166.0>:
@@@@@@@ replicating <<"eunit-test-db-1443762758699299">> >
<<"http://127.0.0.1:58985/eunit-test-db-1443762758720227">>
couch_replicator_large_atts_tests:91:
should_populate_source...test/couch_replicator_test_helper.erl:98:<0.446.0>:
@@@@@@@@@ Replicate <<"eunit-test-db-1443762758699299">> =>
<<"http://127.0.0.1:58985/eunit-test-db-1443762758720227">>
test/couch_replicator_test_helper.erl:106:<0.446.0>: @@@@ Rep
{rep,{"9232f5f1b3ccb9fe9d7b781873a147cf",[]},
<<"eunit-test-db-1443762758699299">>,
{httpdb,"http://127.0.0.1:58985/eunit-test-db-1443762758720227/",
nil,
[{"Accept","application/json"},
{"User-Agent","CouchDB-Replicator/4ca9e41"}],
30000,
[{socket_options,[{keepalive,true},{nodelay,false}]}],
10,250,nil,20},
[{checkpoint_interval,5000},
{connection_timeout,30000},
{http_connections,20},
{retries,10},
{socket_options,[{keepalive,true},{nodelay,false}]},
{use_checkpoints,true},
{worker_batch_size,500},
{worker_processes,4}],
{user_ctx,null,[<<"_admin">>],undefined},
db,nil,null,null}
01:13:08.813 [notice] !!!!!! async_replicate "9232f5f1b3ccb9fe9d7b781873a147cf"
[] <<"eunit-test-db-1443762758699299">> ->
{httpdb,"http://127.0.0.1:58985/eunit-test-db-1443762758720227/",nil,[{"Accept","application/json"},{"User-Agent","CouchDB-Replicator/4ca9e41"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,nil,20}
tout: 30000
01:13:08.813 [notice] starting new replication
`9232f5f1b3ccb9fe9d7b781873a147cf` at <0.693.0>
(`eunit-test-db-1443762758699299` ->
`http://127.0.0.1:58985/eunit-test-db-1443762758720227/`)
=PROGRESS REPORT==== 2-Oct-2015::01:13:08 ===
supervisor: {local,couch_replicator_job_sup}
started: [{pid,<0.693.0>},
{name,"9232f5f1b3ccb9fe9d7b781873a147cf"},
{mfargs,
{gen_server,start_link,
[couch_replicator,
{rep,
{"9232f5f1b3ccb9fe9d7b781873a147cf",[]},
<<"eunit-test-db-1443762758699299">>,
{httpdb,
"http://127.0.0.1:58985/eunit-test-db-1443762758720227/",
nil,
[{"Accept","application/json"},
{"User-Agent","CouchDB-Replicator/4ca9e41"}],
30000,
[{socket_options,
[{keepalive,true},{nodelay,false}]}],
10,250,nil,20},
[{checkpoint_interval,5000},
{connection_timeout,30000},
{http_connections,20},
{retries,10},
{socket_options,
[{keepalive,true},{nodelay,false}]},
{use_checkpoints,true},
{worker_batch_size,500},
{worker_processes,4}],
{user_ctx,null,[<<"_admin">>],undefined},
db,nil,null,null},
[{timeout,30000}]]}},
{restart_type,temporary},
{shutdown,250},
{child_type,worker}]
=PROGRESS REPORT==== 2-Oct-2015::01:13:08 ===
supervisor: {local,inet_gethost_native_sup}
started: [{pid,<0.697.0>},{mfa,{inet_gethost_native,init,[[]]}}]
=PROGRESS REPORT==== 2-Oct-2015::01:13:08 ===
supervisor: {local,kernel_safe_sup}
started: [{pid,<0.696.0>},
{name,inet_gethost_native_sup},
{mfargs,{inet_gethost_native,start_link,[]}},
{restart_type,temporary},
{shutdown,1000},
{child_type,worker}]
01:13:08.879 [notice] 127.0.0.1 - - GET /eunit-test-db-1443762758720227/ 200
01:13:08.881 [notice] 127.0.0.1 - - GET /eunit-test-db-1443762758720227/ 200
01:13:08.882 [notice] 127.0.0.1 - - GET
/eunit-test-db-1443762758720227/_local/9232f5f1b3ccb9fe9d7b781873a147cf 404
01:13:08.882 [notice] 127.0.0.1 - - GET
/eunit-test-db-1443762758720227/_local/c1b4431ba39c56ce3667d7e4e24deb24 404
01:13:08.883 [notice] 127.0.0.1 - - GET
/eunit-test-db-1443762758720227/_local/ce13e7d20a5667809215382ecddd4e6a 404
01:13:08.891 [notice] Replication `"9232f5f1b3ccb9fe9d7b781873a147cf"` is using:
4 worker processes
a worker batch size of 500
20 HTTP connections
a connection timeout of 30000 milliseconds
10 retries per request
socket options are: [{keepalive,true},{nodelay,false}]
01:13:08.891 [notice] Worker pids are: [<0.704.0>,<0.705.0>,<0.706.0>,<0.707.0>]
01:13:08.892 [notice] 127.0.0.1 - - POST
/eunit-test-db-1443762758720227/_revs_diff 200
01:13:08.892 [notice] Worker flushing doc with attachments local_doc_handler
<0.704.0>
{doc,<<"doc1">>,{1,[<<213,87,17,46,228,19,61,125,81,66,181,76,27,94,144,45>>]},{[]},[{att,<<"att1">>,<<"text/plain">>,65574,65536,<<234,108,210,5,238,127,62,203,190,26,30,150,49,168,204,53>>,1,{<0.440.0>,[{4240,4000},{8245,12000},{20251,4000},{24256,12000},{36263,4000},{40268,12000},{52275,17574}]},gzip},{att,<<"att2">>,<<"app/binary">>,65537,65537,<<203,17,75,208,5,141,155,169,198,203,140,229,153,12,127,210>>,1,{<0.440.0>,[{69858,8192},{78056,8192},{86254,8192},{94452,8192},{102650,8192},{110848,8192},{119046,8192},{127244,8192},{135442,1}]},identity}],false,[]}
[]
01:13:08.927 [notice] 127.0.0.1 - - PUT
/eunit-test-db-1443762758720227/doc1?new_edits=false 201
01:13:08.928 [notice] Worker flushing doc with attachments local_doc_handler
<0.704.0>
{doc,<<"doc2">>,{1,[<<205,189,248,163,192,229,245,166,243,179,143,177,110,124,227,33>>]},{[]},[{att,<<"att1">>,<<"text/plain">>,65574,65536,<<98,82,170,248,169,62,233,38,88,5,93,115,255,138,151,22>>,1,{<0.440.0>,[{135447,4000},{139452,12000},{151458,4000},{155463,12000},{167470,4000},{171475,12000},{183482,17574}]},gzip},{att,<<"att2">>,<<"app/binary">>,65537,65537,<<64,190,238,174,102,16,103,67,66,184,107,184,25,130,126,183>>,1,{<0.440.0>,[{201065,8192},{209263,8192},{217461,8192},{225659,8192},{233857,8192},{242055,8192},{250253,8192},{258451,8192},{266649,1}]},identity}],false,[]}
[]
01:13:08.929 [error] Replicator, request PUT to
"http://127.0.0.1:58985/eunit-test-db-1443762758720227/doc2?new_edits=false"
failed due to error {error,connection_closing}
01:13:08.929 [error] Worker <0.704.0> died with reason:
{http_request_failed,"PUT","http://127.0.0.1:58985/eunit-test-db-1443762758720227/doc2?new_edits=false",{error,{error,connection_closing}}}
test/couch_replicator_test_helper.erl:116:<0.446.0>: @@@@@@ Replication
{worker_died,<0.704.0>,
{http_request_failed,
"PUT",
"http://127.0.0.1:58985/eunit-test-db-1443762758720227/doc2?new_edits=false",
{error,
{error,
connection_closing}}}}
01:13:08.929 [error] Replication `9232f5f1b3ccb9fe9d7b781873a147cf`
(`eunit-test-db-1443762758699299` ->
`http://127.0.0.1:58985/eunit-test-db-1443762758720227/`) failed:
{worker_died,<0.704.0>,
{http_request_failed,"PUT",
"http://127.0.0.1:58985/eunit-test-db-1443762758720227/doc2?new_edits=false",
{error,{error,connection_closing}}}}
01:13:08.931 [error] gen_server <0.693.0> terminated with reason:
{worker_died,<0.704.0>,{http_request_failed,"PUT","http://127.0.0.1:58985/eunit-test-db-1443762758720227/doc2?new_edits=false",{error,{error,connection_closing}}}}
[30.135 s] ok
couch_replicator_large_atts_tests:106:
should_compare_databases...test/couch_replicator_test_helper.erl:15:<0.446.0>:
=CRASH REPORT==== 2-Oct-2015::01:13:08 ===
crasher:
initial call: couch_replicator:init/1
pid: <0.693.0>
registered_name: []
exception exit: {worker_died,<0.704.0>,
{http_request_failed,"PUT",
"http://127.0.0.1:58985/eunit-test-db-1443762758720227/doc2?new_edits=false",
{error,{error,connection_closing}}}}
in function gen_server:terminate/7 (gen_server.erl, line 804)
ancestors: [couch_replicator_job_sup,couch_replicator_sup,<0.424.0>]
messages: [{'EXIT',<0.700.0>,normal}]
links: [<0.438.0>,<0.703.0>]
dictionary: [{task_status_props,
[{changes_pending,2},
{checkpoint_interval,5000},
{checkpointed_source_seq,0},
{continuous,false},
{database,null},
{doc_id,null},
{doc_write_failures,0},
{docs_read,0},
{docs_written,0},
{missing_revisions_found,0},
{replication_id,<<"9232f5f1b3ccb9fe9d7b781873a147cf">>},
{revisions_checked,0},
{source,<<"eunit-test-db-1443762758699299">>},
{source_seq,0},
{started_on,1443762788},
{target,
<<"http://127.0.0.1:58985/eunit-test-db-1443762758720227/">>},
{through_seq,0},
{type,replication},
{updated_on,1443762788},
{user,null}]},
{ibrowse_stream_status,init},
{pending_count_state,{{1443,762788,891640},2}},
{task_status_update,{{0,0,0},1000000}}]
trap_exit: true
status: running
heap_size: 4185
stack_size: 27
reductions: 10372
neighbours:
neighbour: [{pid,<0.703.0>},
{registered_name,[]},
{initial_call,{erlang,apply,2}},
{current_function,
{couch_replicator,changes_manager_loop_open,4}},
{ancestors,[]},
{messages,[]},
{links,[<0.693.0>]},
{dictionary,[]},
{trap_exit,false},
{status,waiting},
{heap_size,233},
{stack_size,8},
{reductions,55}]
=SUPERVISOR REPORT==== 2-Oct-2015::01:13:08 ===
Supervisor: {local,couch_replicator_job_sup}
Context: child_terminated
Reason: {worker_died,<0.704.0>,
{http_request_failed,"PUT",
"http://127.0.0.1:58985/eunit-test-db-1443762758720227/doc2?new_edits=false",
{error,{error,connection_closing}}}}
Offender: [{pid,<0.693.0>},
{name,"9232f5f1b3ccb9fe9d7b781873a147cf"},
{mfargs,{gen_server,start_link,undefined}},
{restart_type,temporary},
{shutdown,250},
{child_type,worker}]
*failed*
in function couch_replicator_test_helper:'-compare_dbs/2-fun-17-'/5
(test/couch_replicator_test_helper.erl, line 16)
in call from couch_btree:stream_kv_node2/8 (src/couch_btree.erl, line 783)
in call from couch_btree:fold/4 (src/couch_btree.erl, line 217)
in call from couch_db:enum_docs/5 (src/couch_db.erl, line 1251)
in call from couch_replicator_test_helper:compare_dbs/2
(test/couch_replicator_test_helper.erl, line 61)
**error:{badmatch,{not_found,missing}}
01:13:08.943 [error] CRASH REPORT Process <0.693.0> with 1 neighbours exited
with reason:
{worker_died,<0.704.0>,{http_request_failed,"PUT","http://127.0.0.1:58985/eunit-test-db-1443762758720227/doc2?new_edits=false",{error,{error,connection_closing}}}}
in gen_server:terminate/7 line 804
[done in 30.198 s]
01:13:08.943 [error] Supervisor couch_replicator_job_sup had child
"9232f5f1b3ccb9fe9d7b781873a147cf" started with
{gen_server,start_link,undefined} at <0.693.0> exit with reason
{worker_died,<0.704.0>,{http_request_failed,"PUT","http://127.0.0.1:58985/eunit-test-db-1443762758720227/doc2?new_edits=false",{error,{error,connection_closing}}}}
in context child_terminated
01:13:08.944 [info] Application couch_replicator exited with reason: stopped
01:13:08.945 [info] Application mem3 exited with reason: stopped
01:13:08.945 [info] Application rexi exited with reason: stopped
01:13:08.946 [info] Application couch exited with reason: stopped
01:13:08.946 [info] Application couch_stats exited with reason: stopped
01:13:08.947 [info] Application folsom exited with reason: stopped
01:13:08.947 [info] Application ioq exited with reason: stopped
01:13:08.947 [info] Application b64url exited with reason: stopped
01:13:08.947 [info] Application oauth exited with reason: stopped
01:13:08.947 [info] Application mochiweb exited with reason: stopped
01:13:08.947 [info] Application xmerl exited with reason: stopped
01:13:08.948 [info] Application os_mon exited with reason: stopped
01:13:08.948 [info] Application sasl exited with reason: stopped
01:13:08.948 [info] Application couch_event exited with reason: stopped
01:13:08.948 [info] Application couch_log exited with reason: stopped
01:13:08.948 [info] Application couch_log_lager exited with reason: stopped
01:13:08.948 [info] Application khash exited with reason: stopped
01:13:08.948 [info] Application couch_epi exited with reason: stopped
01:13:08.949 [info] Application config exited with reason: stopped
01:13:08.949 [info] Application ssl exited with reason: stopped
01:13:08.949 [info] Application public_key exited with reason: stopped
01:13:08.949 [info] Application asn1 exited with reason: stopped
01:13:08.949 [info] Application ibrowse exited with reason: stopped
01:13:08.949 [info] Application inets exited with reason: stopped
=INFO REPORT==== 2-Oct-2015::01:13:08 ===
application: lager
exited: stopped
type: temporary
=INFO REPORT==== 2-Oct-2015::01:13:08 ===
application: goldrush
exited: stopped
type: temporary
[done in 30.198 s]
=======================================================
=INFO REPORT==== 2-Oct-2015::01:13:08 ===
application: compiler
exited: stopped
type: temporary
Failed: 1. Skipped: 0. Passed: 2.
=INFO REPORT==== 2-Oct-2015::01:13:08 ===
application: syntax_tools
exited: stopped
type: temporary
ERROR: One or more eunit tests failed.
ERROR: eunit failed while processing
/Users/nvatama/asf/couchdb/src/couch_replicator: rebar_abort
{code}
> Replicator client doesn't handle un-expectedly closed pipelined connections
> well
> --------------------------------------------------------------------------------
>
> Key: COUCHDB-2833
> URL: https://issues.apache.org/jira/browse/COUCHDB-2833
> Project: CouchDB
> Issue Type: Bug
> Security Level: public(Regular issues)
> Components: Database Core
> Reporter: Nick Vatamaniuc
>
> This was found investigating the failure of replication tests. Specifically
> couch_replicator_large_atts_tests, the {local, remote} sub-case.
> The test sets up push replications from local to remote.
> Replication workers have more than 1 document larger than
> MAX_BULK_ATT_SIZE=64K. They start pushing them to the target, using a
> keep-alive connection (default for HTTP 1.1), the first few pipelined
> requests will go through using the same connection, then server will accept
> the first PUT to …/docid?edits=false, then return Connection:close and close
> the connection after the 201 Created result. Workers don't expect that, and
> try to do another PUT on same connection. And then crash on ibrowser's
> connection_closing error, which they don't handle. That causes the whole
> async replication process to exit.
> Potentially there are 2 issues.
> couch_replicator_http layer needs to handle this case better. On closing
> error, shut down the socket quickly and then retry. (Not shutting it down and
> retrying means retrying for at least 5 or so seconds until something cleans
> up that connection state).
> Adding this clause to couch_replicator_httpc.erl seems to do the trick:
> {code}
> process_response({error,connection_closing}, Worker, HttpDb, Params, _Cb)->
> couch_log:notice("Connection closed by server. Closing the socket and
> trying again",[]),
> ibrowse_http_client:stop(Worker),
> throw({retry, HttpDb, Params});
> {code}
> Another issue is to make the server not close the connection after first PUT
> to .../db/docid?new_edits=false when using pipeline connections.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)