[
https://issues.apache.org/jira/browse/COUCHDB-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678231#action_12678231
]
Adam Kocoloski commented on COUCHDB-270:
----------------------------------------
Hi Jeff, thanks for the test case -- it proved quite useful. I have a
short-term fix and a longer-term solution. I'll attach the quick fix now (still
waiting on SVN commit access). It merely sets the ibrowse timeout to infinity
and does a better check of the memory utilization when deciding whether to
flush the save_docs_buffer.
The longer term solution will likely require scrapping the parallelized portion
of the replicator. Processing 100 documents in parallel doesn't give us enough
control over the memory utilization on the servers. I'm confident I can rewrite
the module so that it loads documents into memory "just-in-time" and maintains
the same throughput with a much lower memory footprint.
I ran your test suite between two m1.large instances on EC2 and got the
following results:
200x512k push OK
200x512k pull OK
200x1m push OK
200x1m pull OK
200x20m push OK
200x20m pull FAIL
The last one failed in "head shot" fashion -- I believe the target server
Erlang VM ran out of memory. Fixing that one will require the long-term
solution. Will keep you posted,
Adam
> Replication w/ Large Attachments Fails
> --------------------------------------
>
> Key: COUCHDB-270
> URL: https://issues.apache.org/jira/browse/COUCHDB-270
> Project: CouchDB
> Issue Type: Bug
> Components: Database Core
> Affects Versions: 0.9
> Environment: Apache CouchDB 0.9.0a748379
> Reporter: Jeff Hinrichs
> Attachments: couchdb270_Test.py
>
>
> Attempting to replicate a database with largish attachments (<= ~18MB of
> attachments in a doc, less thatn 200 docs) from one machine to another fails
> consistently and at the same point.
> Scenario:
> Both servers are running from HEAD and I've been tracking for some time.
> This problem has been around as long as I've been using couch.
> Machine A holds the original database, Machine B is the server that is doing
> a PULL replication
> During the replication, Machine A starts showing the following sporadically
> in the log:
> [Fri, 27 Feb 2009 14:02:48 GMT] [debug] [<0.5902.3>] 'GET'
> /delasco-invoices/INV00652429?revs=true&attachments=true&latest=true&open_revs=["425644723"]
> {1,
> 1}
> Headers: [{'Host',"192.168.2.52:5984"}]
> [Fri, 27 Feb 2009 14:02:48 GMT] [error] [<0.5901.3>] Uncaught error in
> HTTP request: {exit,normal}
> [Fri, 27 Feb 2009 14:02:48 GMT] [debug] [<0.5901.3>] Stacktrace:
> [{mochiweb_request,send,2},
> {couch_httpd,send_chunk,2},
> {couch_httpd_db,db_doc_req,3},
> {couch_httpd_db,do_db_req,2},
> {couch_httpd,handle_request,3},
> {mochiweb_http,headers,5},
> {proc_lib,init_p,5}]
> [Fri, 27 Feb 2009 14:02:48 GMT] [debug] [<0.5901.3>] HTTPd 500 error response:
> {"error":"error","reason":"normal"}
> As the replication continues, the frequency of these error "Uncaught error in
> HTTP request: {exit,normal}" increase. Until the error is being constantly
> repeated. Then Machine B stops sending requests, no more log output, no
> errors, the last thing in Machine B's log file is:
> [Fri, 27 Feb 2009 14:03:24 GMT] [info] [<0.20893.1>] retrying
> couch_rep HTTP get request due to {error, req_timedout}: [104,116,
> 116,112,58,
> 47,47,49,
> 57,50,46,
> 49,54,56,
> 46,50,46,
> 53,50,58,
> 53,57,56,
> 52,47,100,
> 101,108,97,
> 115,99,111,
> 45,105,110,
> 118,111,
> 105,99,101,
> 115,47,73,
> 78,86,48,
> 48,54,53,
> 50,49,51,
> 56,63,114,
> 101,118,
> 115,61,116,
> 114,117,
> 101,38,97,
> 116,116,97,
> 99,104,109,
> 101,110,
> 116,115,61,
> 116,114,
> 117,101,38,
> 108,97,116,
> 101,115,
> 116,61,116,
> 114,117,
> 101,38,111,
> 112,101,
> 110,95,114,
> 101,118,
> 115,61,91,
> 34,
> <<"3070455362">>,
> 34,93]
> A request for status from the couchdb init.d script returns nothing and
> checking the processes returns:
> (demo-couchdb)j...@mars:~/projects/venvs/demo-couchdb/src$ ps ax|grep cou
> 29281 pts/2 S+ 0:00 grep cou
> (demo-couchdb)j...@mars:~/projects/venvs/demo-couchdb/src$ ps ax|grep beam
> 29305 pts/2 R+ 0:00 grep beam
> In fact, couch has gone away completely on Machine B. In fact, couch's death
> is so quick it can't even say why.
> Attempts to incrementally replicate after the first failure die at exactly
> the same place.
> I can replicate this same database on the same machine from one database to
> another without issue. I can dump and reload the database with no problems.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.