Re: progress on streaming attachments during replication

Damien Katz Thu, 12 Feb 2009 16:24:39 -0800


On Feb 12, 2009, at 7:01 PM, Adam Kocoloski wrote:

Hi devs, I spent a good bit of time over the last two days onattachment replication. I started with pull replication since I hada pretty clear idea of what I wanted to do there:
a) stop inlining attachments in the JSON document body
b) map over the attachment stubs in the source document, submittingan async HTTP request for each,
c) replace the stub with a function that's streaming-API-compatible(meaning it looks like F() -> binary() and can be called repeatedlyuntil all data has been returned). In this case the function isjust a wrapper around a receive statement. Damien's streamingattachment API takes it from there.
Well, I got all that written and working, but I ran into sometrouble with ibrowse's async HTTP requests:
* it doesn't look like ibrowse supports any flow control. You tellit to stream a response to a process, and it just opens up thefirehose and sends messages to that process until the response iscomplete.
* ibrowse sends a message for each received packet. I tried thiscode out with a 32MB attachment and got 20k messages in my mailbox.Combine that with the lack of flow control and the writer mailboxblows up pretty quickly.
* Less important, but ibrowse sends the data as lists of bytesrather than binaries. Seems like a lot of unnecessary copying to me.
Anyone know of a mailing list for ibrowse, or do we just emailChandrashekhar directly? It'd be good to get some confirmation fromhim on this.
I also took a look at inets' async support and found that it workedquite a bit better -- it has flow control with the {self,once}option, it sends 1 message / chunk (CouchDB default chunk size =1MB), and it sends that message as a binary (so no copying).
However, inets also had some problems. I saw that the VM memoryusage still climbed pretty quickly when replicating a bigattachment, and etop told me that it was all in binaries. I triedprocess_info(Pid, binary) and found that the httpc_handler processspawned for that attachment request was keeping a reference to eachbinary chunk. At least, that's what it looked like to me -- Ididn't find any documentation on the BinInfo tuples returned byprocess_info() so I took a guess that they were {UniqueID, Size,NRefs}.
I was able to replicate GB-sized attachments with the inets asynccode. Unfortunately, the Erlang VM took all my free memory and hada VSIZE of ~500 MB when it finished. I tried tossinggarbage_collect() in the couch_db, couch_stream, and couch_fileprocesses, but it seems the problem is really in the inetshttpc_handler. Nothing else was keeping a reference to the oldbinaries. Anybody know of additional tricks for debugging Erlangmemory utilization in general and binary reference counting inparticular?
Sorry for the long post.  Best, Adam

Woot! This is awesome Adam. Sorry, I don't have any answers on thehttp client stuff. Maybe we should check on the Erlang list foravailable options.


-Damien

Re: progress on streaming attachments during replication

Reply via email to