To answer my own question on this: Option 3 turns out to be simple to implement and apparently efficient, so I'm testing some code now and will put together a pull request when I'm happy with it. The code replaces both chunked and unchunked transfers with the Mochiweb stream_body function so I'm trying out a patched installation on the current hot topic: replication of the NPM registry.
Nick On 4 December 2013 10:06, Nick North <[email protected]> wrote: > As mentioned the other day, I'm hoping to add CouchDb support for chunked > HTTP requests that contain a document and attachments as a single > multipart/related MIME request, and I'm hoping the group can advise me on > the best coding direction. Apologies in advance for the length and detail > of the email, but there doesn't seem to be a shorter way to ask the > question with a sensible amount of background. > > Parsing multipart requests happens > in couch_httpd:parse_multipart_request/3. This function scans the request > for the MIME boundary string, reading 4KB blocks of data as needed. The > pieces of data between boundary strings are passed to callback functions > for further processing. The function to read the next block of data is an > argument to parse_multipart_request called DataFun; it returns the data > block plus the function to be used as the next DataFun. I think of this as > a pull-based approach: data is pulled from the request as needed, with the > pull returning some data and a new pull function. > > The natural extension to handle chunked requests would be to provide an > improved DataFun that can grab the next 4KB block from either a chunked or > an unchunked request. So I looked for existing support for chunked requests > that could be reused. The chunked equivalent of the couch_httpd:recv/2 > function that's used to pull 4KB blocks is the couch_httpd:recv_chunked/4 > function. This calls the Mochiweb stream_body/3 function which, it > transpires, was created for use in CouchDb. However, this differs in > philosophy from the recv function: while recv just hands back a block of > data, stream_body reads the whole of the request and calls a ChunkFun > parameter on each block of data that it reads. I think of this as a > push-based approach: the entire stream is read and pushed into a callback > function, one block at a time. > > I can think of three ways to fix the mismatch between the pull and > push-based approaches and provide chunked multipart support: > > 1. Rework parse_multipart_request to be push-based. This would allow > reuse of stream_body, but at the cost of turning existing code inside out > to fit with its push approach. > 2. Create a pull-based version of stream_body and probably try to get > in incorporated into Mochiweb. But having two similar versions of the same > code like this doesn't feel right. > 3. Convert stream_body from push-based to pull-based by spawning it in > a new process that sends each block of data back to the > parse_multipart_request DataFun and then blocks until the message is > acknowledged. The DataFun receives the data when it needs to fetch the next > block, and then sends an acknowledgement. > > The third option feels neatest and is my preferred route. But my ignorance > of Erlang means that I don't know whether this is potentially expensive. > While a new process is very cheap, it would mean that all the request data > is copied from that process to parse_multipart_request, and I don't know if > that is very costly. That sort of copying already goes on > in couch_doc:doc_from_multi_part_stream where the parser is spawned off and > copies each document and attachment back to the parent process but I don't > know if that means the copying is cheap, or if it's an unavoidable evil > that shouldn't be reproduced elsewhere. > > I'd really appreciate any advice that the group can give me on the best > option to follow, and why, or suggestions for options that I've missed > altogether. Thanks in advance for your help, > > Nick > > > >
