Hi devs, I spent a good bit of time over the last two days on attachment replication. I started with pull replication since I had a pretty clear idea of what I wanted to do there:

a) stop inlining attachments in the JSON document body

b) map over the attachment stubs in the source document, submitting an async HTTP request for each,

c) replace the stub with a function that's streaming-API-compatible (meaning it looks like F() -> binary() and can be called repeatedly until all data has been returned). In this case the function is just a wrapper around a receive statement. Damien's streaming attachment API takes it from there.

Well, I got all that written and working, but I ran into some trouble with ibrowse's async HTTP requests:

* it doesn't look like ibrowse supports any flow control. You tell it to stream a response to a process, and it just opens up the firehose and sends messages to that process until the response is complete.

* ibrowse sends a message for each received packet. I tried this code out with a 32MB attachment and got 20k messages in my mailbox. Combine that with the lack of flow control and the writer mailbox blows up pretty quickly.

* Less important, but ibrowse sends the data as lists of bytes rather than binaries. Seems like a lot of unnecessary copying to me.

Anyone know of a mailing list for ibrowse, or do we just email Chandrashekhar directly? It'd be good to get some confirmation from him on this.

I also took a look at inets' async support and found that it worked quite a bit better -- it has flow control with the {self,once} option, it sends 1 message / chunk (CouchDB default chunk size = 1MB), and it sends that message as a binary (so no copying).

However, inets also had some problems. I saw that the VM memory usage still climbed pretty quickly when replicating a big attachment, and etop told me that it was all in binaries. I tried process_info(Pid, binary) and found that the httpc_handler process spawned for that attachment request was keeping a reference to each binary chunk. At least, that's what it looked like to me -- I didn't find any documentation on the BinInfo tuples returned by process_info() so I took a guess that they were {UniqueID, Size, NRefs}.

I was able to replicate GB-sized attachments with the inets async code. Unfortunately, the Erlang VM took all my free memory and had a VSIZE of ~500 MB when it finished. I tried tossing garbage_collect() in the couch_db, couch_stream, and couch_file processes, but it seems the problem is really in the inets httpc_handler. Nothing else was keeping a reference to the old binaries. Anybody know of additional tricks for debugging Erlang memory utilization in general and binary reference counting in particular?

Sorry for the long post.  Best, Adam






Reply via email to