[
https://issues.apache.org/jira/browse/COUCHDB-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231206#comment-14231206
]
Gunther Gruber commented on COUCHDB-2484:
-----------------------------------------
No some of the replications task which are about 80 do not restart. Current fix
on the backup system is to restart couchdb every hour. Former attempt to prove
succesfull, was to set up the replication of each database one by one and wait
until it was finished before starting the next.
On the new system we only had the problem with the second server, which is not
as efficient hardware as the other one. One thing I noticed I/O waits going up
to 50-80%. Which I supposed it the bottleneck, but not the root cause.
> replication crashes
> -------------------
>
> Key: COUCHDB-2484
> URL: https://issues.apache.org/jira/browse/COUCHDB-2484
> Project: CouchDB
> Issue Type: Bug
> Security Level: public(Regular issues)
> Components: Database Core
> Affects Versions: 1.x.x
> Reporter: Gunther Gruber
>
> We are Using Couchdb Version 1.2.0 with 8.3T of data, biggest Database ist
> 2.1T. At this moment we switch to new hardware with more storage space. We
> copied the files with rsync and started the replication.
> One system is already in sync, the other is doing the replication.
> I appreciate that besides the errors in the log, the first system is now in
> sync.
> The log looks like the following
> Retrying POST request to http://replication:XXXX/database/_revs_diff in 0.5
> seconds due to error req_timedout
> and then
> Mon, 01 Dec 2014 13:00:28 GMT] [error] [<0.27044.1>] ** Generic server
> <0.27044.1> terminating
> ** Last message in was {'EXIT',<0.26965.1>,killed}
> ** When Server state == {state,<0.26965.1>,<0.27045.1>,40,
> {httpdb,
> "http://replication:[email protected]/sm_chemie/",
> nil,
> [{"Accept","application/json"},
> {"User-Agent","CouchDB/1.2.0"}],
> 30000,
> [{socket_options,
> [{recbuf,262144},
> {sndbuf,262144},
> {nodelay,true},
> {keepalive,true}]}],
> 10,250,<0.26966.1>,40},
> {httpdb,
> "http://replication:XXX@XXX:5984/sm_chemie/",
> nil,
> [{"Accept","application/json"},
> {"User-Agent","CouchDB/1.2.0"}],
> 30000,
> [{socket_options,
> [{recbuf,262144},
> {sndbuf,262144},
> {nodelay,true},
> {keepalive,true}]}],
> 10,250,<0.26968.1>,40},
> [],nil,nil,nil,
> {rep_stats,0,0,0,0,0},
> nil,nil,
> {batch,[],0}}
> ** Reason for termination ==
> ** killed
> [Mon, 01 Dec 2014 13:00:28 GMT] [error] [<0.27042.1>] {error_report,<0.31.0>,
> {<0.27042.1>,crash_report,
> [[{initial_call,
> {couch_replicator_worker,init,['Argument__1']}},
> {pid,<0.27042.1>},
> {registered_name,[]},
> {error_info,
> {exit,killed,
> [{gen_server,terminate,6,
> [{file,"gen_server.erl"},{line,747}]},
> {proc_lib,init_p_do_apply,3,
> [{file,"proc_lib.erl"},{line,227}]}]}},
> {ancestors,
> [<0.26965.1>,couch_rep_sup,couch_primary_services,
> couch_server_sup,<0.32.0>]},
> {messages,[]},
> {links,[<0.27043.1>]},
> {dictionary,
> [{last_stats_report,{1417,438797,704976}}]},
> {trap_exit,true},
> {status,running},
> {heap_size,377},
> {stack_size,24},
> {reductions,372}],
> []]}}
> It seems to me like a timeout and the replication task then exits. I allready
> played arround with the configuration setting with no succes. I can provide
> more information if needed.
> /etc/couchdb/local.d/001-user_config.ini
> [couchdb]
> file_compression = snappy
> max_dbs_open = 400
> [httpd]
> bind_address = ::
> server_options = [{backlog, 128}, {acceptor_pool_size, 16}]
> socket_options = [{recbuf, 262144}, {sndbuf, 262144}, {nodelay, true},
> {keepalive, true}]
> [couch_httpd_auth]
> secret =
> [log_level_by_module]
> couch_httpd = warning
> couch_replicator = debug
> couch_query_servers = warning
> [daemons]
> httpsd = {couch_httpd, start_link, [https]}
> [ssl]
> cert_file = /etc/couchdb/ssl/certs/couchdb-couch1.prime.adns.de.pem
> key_file = /etc/couchdb/ssl/private/couchdb-couch1.prime.adns.de.pem
> verify_ssl_certificates = false
> [replicator]
> worker_batch_size = 2000
> worker_processes = 40
> http_connections = 40
> socket_options = [{recbuf, 262144}, {sndbuf, 262144}, {nodelay, true},
> {keepalive, true}]
> /etc/default/couchdb
> # Sourced by init script for configuration.
> COUCHDB_USER=couchdb
> COUCHDB_STDOUT_FILE=/dev/null
> COUCHDB_STDERR_FILE=/dev/null
> COUCHDB_RESPAWN_TIMEOUT=5
> COUCHDB_OPTIONS=
> # 32 Threads to handle I/O
> export ERL_FLAGS="+A 32"
> # 8192 open files
> export ERL_MAX_PORTS=8192
> ulimit -n 8192
> Current solution is to restart couchdb every other hour
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)