On Feb 12, 2009, at 7:01 PM, Adam Kocoloski wrote:
Hi devs, I spent a good bit of time over the last two days on
attachment replication. I started with pull replication since I had
a pretty clear idea of what I wanted to do there:
a) stop inlining attachments in the JSON document body
b) map over the attachment stubs in the source document, submitting
an async HTTP request for each,
c) replace the stub with a function that's streaming-API-compatible
(meaning it looks like F() -> binary() and can be called repeatedly
until all data has been returned). In this case the function is
just a wrapper around a receive statement. Damien's streaming
attachment API takes it from there.
Well, I got all that written and working, but I ran into some
trouble with ibrowse's async HTTP requests:
* it doesn't look like ibrowse supports any flow control. You tell
it to stream a response to a process, and it just opens up the
firehose and sends messages to that process until the response is
complete.
* ibrowse sends a message for each received packet. I tried this
code out with a 32MB attachment and got 20k messages in my mailbox.
Combine that with the lack of flow control and the writer mailbox
blows up pretty quickly.
* Less important, but ibrowse sends the data as lists of bytes
rather than binaries. Seems like a lot of unnecessary copying to me.
Anyone know of a mailing list for ibrowse, or do we just email
Chandrashekhar directly? It'd be good to get some confirmation from
him on this.
I also took a look at inets' async support and found that it worked
quite a bit better -- it has flow control with the {self,once}
option, it sends 1 message / chunk (CouchDB default chunk size =
1MB), and it sends that message as a binary (so no copying).
However, inets also had some problems. I saw that the VM memory
usage still climbed pretty quickly when replicating a big
attachment, and etop told me that it was all in binaries. I tried
process_info(Pid, binary) and found that the httpc_handler process
spawned for that attachment request was keeping a reference to each
binary chunk. At least, that's what it looked like to me -- I
didn't find any documentation on the BinInfo tuples returned by
process_info() so I took a guess that they were {UniqueID, Size,
NRefs}.
I was able to replicate GB-sized attachments with the inets async
code. Unfortunately, the Erlang VM took all my free memory and had
a VSIZE of ~500 MB when it finished. I tried tossing
garbage_collect() in the couch_db, couch_stream, and couch_file
processes, but it seems the problem is really in the inets
httpc_handler. Nothing else was keeping a reference to the old
binaries. Anybody know of additional tricks for debugging Erlang
memory utilization in general and binary reference counting in
particular?
Sorry for the long post. Best, Adam
Woot! This is awesome Adam. Sorry, I don't have any answers on the
http client stuff. Maybe we should check on the Erlang list for
available options.
-Damien