Hi devs, I spent a good bit of time over the last two days on
attachment replication. I started with pull replication since I had a
pretty clear idea of what I wanted to do there:
a) stop inlining attachments in the JSON document body
b) map over the attachment stubs in the source document, submitting an
async HTTP request for each,
c) replace the stub with a function that's streaming-API-compatible
(meaning it looks like F() -> binary() and can be called repeatedly
until all data has been returned). In this case the function is just
a wrapper around a receive statement. Damien's streaming attachment
API takes it from there.
Well, I got all that written and working, but I ran into some trouble
with ibrowse's async HTTP requests:
* it doesn't look like ibrowse supports any flow control. You tell it
to stream a response to a process, and it just opens up the firehose
and sends messages to that process until the response is complete.
* ibrowse sends a message for each received packet. I tried this code
out with a 32MB attachment and got 20k messages in my mailbox.
Combine that with the lack of flow control and the writer mailbox
blows up pretty quickly.
* Less important, but ibrowse sends the data as lists of bytes rather
than binaries. Seems like a lot of unnecessary copying to me.
Anyone know of a mailing list for ibrowse, or do we just email
Chandrashekhar directly? It'd be good to get some confirmation from
him on this.
I also took a look at inets' async support and found that it worked
quite a bit better -- it has flow control with the {self,once} option,
it sends 1 message / chunk (CouchDB default chunk size = 1MB), and it
sends that message as a binary (so no copying).
However, inets also had some problems. I saw that the VM memory usage
still climbed pretty quickly when replicating a big attachment, and
etop told me that it was all in binaries. I tried process_info(Pid,
binary) and found that the httpc_handler process spawned for that
attachment request was keeping a reference to each binary chunk. At
least, that's what it looked like to me -- I didn't find any
documentation on the BinInfo tuples returned by process_info() so I
took a guess that they were {UniqueID, Size, NRefs}.
I was able to replicate GB-sized attachments with the inets async
code. Unfortunately, the Erlang VM took all my free memory and had a
VSIZE of ~500 MB when it finished. I tried tossing garbage_collect()
in the couch_db, couch_stream, and couch_file processes, but it seems
the problem is really in the inets httpc_handler. Nothing else was
keeping a reference to the old binaries. Anybody know of additional
tricks for debugging Erlang memory utilization in general and binary
reference counting in particular?
Sorry for the long post. Best, Adam