Re: Chunked input for multipart/related requests

Nick North Wed, 04 Dec 2013 04:36:57 -0800

Thanks for that - I agree that careful review is necessary as the impact is
potentially wide-ranging. The tricky cases, where the MIME boundary
overlaps data block boundaries, or where the data block is only a byte
long, are quite hard to test, though I have some ideas for small unit tests
that might do the trick. But I believe the code is correct as it stands, so
wanted to make it available for as many people to break as possible.


Nick


On 4 December 2013 11:41, Robert Newson <[email protected]> wrote:

> specifically your "speedup" patch, sorry for thread jumping.
>
> On 4 December 2013 11:41, Robert Newson <[email protected]> wrote:
> > I will try to review this this week if I can. I've worked in the
> > attachment code area before. The idea behind the speedup makes sense
> > to me but given the potential for data loss / corruption it'll need
> > careful review before it can go in. One thing to note is that the
> > replicator will verify the md5 of attachments as they're transferred,
> > if that helps with your testing.
> >
> > B.
> >
> >
> > On 4 December 2013 10:06, Nick North <[email protected]> wrote:
> >> As mentioned the other day, I'm hoping to add CouchDb support for
> chunked
> >> HTTP requests that contain a document and attachments as a single
> >> multipart/related MIME request, and I'm hoping the group can advise me
> on
> >> the best coding direction. Apologies in advance for the length and
> detail
> >> of the email, but there doesn't seem to be a shorter way to ask the
> >> question with a sensible amount of background.
> >>
> >> Parsing multipart requests happens
> >> in couch_httpd:parse_multipart_request/3. This function scans the
> request
> >> for the MIME boundary string, reading 4KB blocks of data as needed. The
> >> pieces of data between boundary strings are passed to callback functions
> >> for further processing. The function to read the next block of data is
> an
> >> argument to parse_multipart_request called DataFun; it returns the data
> >> block plus the function to be used as the next DataFun. I think of this
> as
> >> a pull-based approach: data is pulled from the request as needed, with
> the
> >> pull returning some data and a new pull function.
> >>
> >> The natural extension to handle chunked requests would be to provide an
> >> improved DataFun that can grab the next 4KB block from either a chunked
> or
> >> an unchunked request. So I looked for existing support for chunked
> requests
> >> that could be reused. The chunked equivalent of the couch_httpd:recv/2
> >> function that's used to pull 4KB blocks is the
> couch_httpd:recv_chunked/4
> >> function. This calls the Mochiweb stream_body/3 function which, it
> >> transpires, was created for use in CouchDb. However, this differs in
> >> philosophy from the recv function: while recv just hands back a block of
> >> data, stream_body reads the whole of the request and calls a ChunkFun
> >> parameter on each block of data that it reads. I think of this as a
> >> push-based approach: the entire stream is read and pushed into a
> callback
> >> function, one block at a time.
> >>
> >> I can think of three ways to fix the mismatch between the pull and
> >> push-based approaches and provide chunked multipart support:
> >>
> >>    1. Rework parse_multipart_request to be push-based. This would allow
> >>    reuse of stream_body, but at the cost of turning existing code
> inside out
> >>    to fit with its push approach.
> >>    2. Create a pull-based version of stream_body and probably try to
> get in
> >>    incorporated into Mochiweb. But having two similar versions of the
> same
> >>    code like this doesn't feel right.
> >>    3. Convert stream_body from push-based to pull-based by spawning it
> in a
> >>    new process that sends each block of data back to the
> >>    parse_multipart_request DataFun and then blocks until the message is
> >>    acknowledged. The DataFun receives the data when it needs to fetch
> the next
> >>    block, and then sends an acknowledgement.
> >>
> >> The third option feels neatest and is my preferred route. But my
> ignorance
> >> of Erlang means that I don't know whether this is potentially expensive.
> >> While a new process is very cheap, it would mean that all the request
> data
> >> is copied from that process to parse_multipart_request, and I don't
> know if
> >> that is very costly. That sort of copying already goes on
> >> in couch_doc:doc_from_multi_part_stream where the parser is spawned off
> and
> >> copies each document and attachment back to the parent process but I
> don't
> >> know if that means the copying is cheap, or if it's an unavoidable evil
> >> that shouldn't be reproduced elsewhere.
> >>
> >> I'd really appreciate any advice that the group can give me on the best
> >> option to follow, and why, or suggestions for options that I've missed
> >> altogether. Thanks in advance for your help,
> >>
> >> Nick
>

Re: Chunked input for multipart/related requests

Reply via email to