Chunked input for multipart/related requests

Nick North Wed, 04 Dec 2013 02:06:53 -0800

As mentioned the other day, I'm hoping to add CouchDb support for chunked
HTTP requests that contain a document and attachments as a single
multipart/related MIME request, and I'm hoping the group can advise me on
the best coding direction. Apologies in advance for the length and detail
of the email, but there doesn't seem to be a shorter way to ask the
question with a sensible amount of background.


Parsing multipart requests happens
in couch_httpd:parse_multipart_request/3. This function scans the request
for the MIME boundary string, reading 4KB blocks of data as needed. The
pieces of data between boundary strings are passed to callback functions
for further processing. The function to read the next block of data is an
argument to parse_multipart_request called DataFun; it returns the data
block plus the function to be used as the next DataFun. I think of this as
a pull-based approach: data is pulled from the request as needed, with the
pull returning some data and a new pull function.

The natural extension to handle chunked requests would be to provide an
improved DataFun that can grab the next 4KB block from either a chunked or
an unchunked request. So I looked for existing support for chunked requests
that could be reused. The chunked equivalent of the couch_httpd:recv/2
function that's used to pull 4KB blocks is the couch_httpd:recv_chunked/4
function. This calls the Mochiweb stream_body/3 function which, it
transpires, was created for use in CouchDb. However, this differs in
philosophy from the recv function: while recv just hands back a block of
data, stream_body reads the whole of the request and calls a ChunkFun
parameter on each block of data that it reads. I think of this as a
push-based approach: the entire stream is read and pushed into a callback
function, one block at a time.

I can think of three ways to fix the mismatch between the pull and
push-based approaches and provide chunked multipart support:

   1. Rework parse_multipart_request to be push-based. This would allow
   reuse of stream_body, but at the cost of turning existing code inside out
   to fit with its push approach.
   2. Create a pull-based version of stream_body and probably try to get in
   incorporated into Mochiweb. But having two similar versions of the same
   code like this doesn't feel right.
   3. Convert stream_body from push-based to pull-based by spawning it in a
   new process that sends each block of data back to the
   parse_multipart_request DataFun and then blocks until the message is
   acknowledged. The DataFun receives the data when it needs to fetch the next
   block, and then sends an acknowledgement.

The third option feels neatest and is my preferred route. But my ignorance
of Erlang means that I don't know whether this is potentially expensive.
While a new process is very cheap, it would mean that all the request data
is copied from that process to parse_multipart_request, and I don't know if
that is very costly. That sort of copying already goes on
in couch_doc:doc_from_multi_part_stream where the parser is spawned off and
copies each document and attachment back to the parent process but I don't
know if that means the copying is cheap, or if it's an unavoidable evil
that shouldn't be reproduced elsewhere.

I'd really appreciate any advice that the group can give me on the best
option to follow, and why, or suggestions for options that I've missed
altogether. Thanks in advance for your help,

Nick

Chunked input for multipart/related requests

Reply via email to