progress on streaming attachments during replication

Adam Kocoloski Thu, 12 Feb 2009 16:02:37 -0800

Hi devs, I spent a good bit of time over the last two days onattachment replication. I started with pull replication since I had apretty clear idea of what I wanted to do there:


a) stop inlining attachments in the JSON document body

b) map over the attachment stubs in the source document, submitting anasync HTTP request for each,

c) replace the stub with a function that's streaming-API-compatible(meaning it looks like F() -> binary() and can be called repeatedlyuntil all data has been returned). In this case the function is justa wrapper around a receive statement. Damien's streaming attachmentAPI takes it from there.

Well, I got all that written and working, but I ran into some troublewith ibrowse's async HTTP requests:

* it doesn't look like ibrowse supports any flow control. You tell itto stream a response to a process, and it just opens up the firehoseand sends messages to that process until the response is complete.

* ibrowse sends a message for each received packet. I tried this codeout with a 32MB attachment and got 20k messages in my mailbox.Combine that with the lack of flow control and the writer mailboxblows up pretty quickly.

* Less important, but ibrowse sends the data as lists of bytes ratherthan binaries. Seems like a lot of unnecessary copying to me.

Anyone know of a mailing list for ibrowse, or do we just emailChandrashekhar directly? It'd be good to get some confirmation fromhim on this.

I also took a look at inets' async support and found that it workedquite a bit better -- it has flow control with the {self,once} option,it sends 1 message / chunk (CouchDB default chunk size = 1MB), and itsends that message as a binary (so no copying).

However, inets also had some problems. I saw that the VM memory usagestill climbed pretty quickly when replicating a big attachment, andetop told me that it was all in binaries. I tried process_info(Pid,binary) and found that the httpc_handler process spawned for thatattachment request was keeping a reference to each binary chunk. Atleast, that's what it looked like to me -- I didn't find anydocumentation on the BinInfo tuples returned by process_info() so Itook a guess that they were {UniqueID, Size, NRefs}.

I was able to replicate GB-sized attachments with the inets asynccode. Unfortunately, the Erlang VM took all my free memory and had aVSIZE of ~500 MB when it finished. I tried tossing garbage_collect()in the couch_db, couch_stream, and couch_file processes, but it seemsthe problem is really in the inets httpc_handler. Nothing else waskeeping a reference to the old binaries. Anybody know of additionaltricks for debugging Erlang memory utilization in general and binaryreference counting in particular?


Sorry for the long post.  Best, Adam

progress on streaming attachments during replication

Reply via email to