Thanks for that - I agree that careful review is necessary as the impact is potentially wide-ranging. The tricky cases, where the MIME boundary overlaps data block boundaries, or where the data block is only a byte long, are quite hard to test, though I have some ideas for small unit tests that might do the trick. But I believe the code is correct as it stands, so wanted to make it available for as many people to break as possible.
Nick On 4 December 2013 11:41, Robert Newson <[email protected]> wrote: > specifically your "speedup" patch, sorry for thread jumping. > > On 4 December 2013 11:41, Robert Newson <[email protected]> wrote: > > I will try to review this this week if I can. I've worked in the > > attachment code area before. The idea behind the speedup makes sense > > to me but given the potential for data loss / corruption it'll need > > careful review before it can go in. One thing to note is that the > > replicator will verify the md5 of attachments as they're transferred, > > if that helps with your testing. > > > > B. > > > > > > On 4 December 2013 10:06, Nick North <[email protected]> wrote: > >> As mentioned the other day, I'm hoping to add CouchDb support for > chunked > >> HTTP requests that contain a document and attachments as a single > >> multipart/related MIME request, and I'm hoping the group can advise me > on > >> the best coding direction. Apologies in advance for the length and > detail > >> of the email, but there doesn't seem to be a shorter way to ask the > >> question with a sensible amount of background. > >> > >> Parsing multipart requests happens > >> in couch_httpd:parse_multipart_request/3. This function scans the > request > >> for the MIME boundary string, reading 4KB blocks of data as needed. The > >> pieces of data between boundary strings are passed to callback functions > >> for further processing. The function to read the next block of data is > an > >> argument to parse_multipart_request called DataFun; it returns the data > >> block plus the function to be used as the next DataFun. I think of this > as > >> a pull-based approach: data is pulled from the request as needed, with > the > >> pull returning some data and a new pull function. > >> > >> The natural extension to handle chunked requests would be to provide an > >> improved DataFun that can grab the next 4KB block from either a chunked > or > >> an unchunked request. So I looked for existing support for chunked > requests > >> that could be reused. The chunked equivalent of the couch_httpd:recv/2 > >> function that's used to pull 4KB blocks is the > couch_httpd:recv_chunked/4 > >> function. This calls the Mochiweb stream_body/3 function which, it > >> transpires, was created for use in CouchDb. However, this differs in > >> philosophy from the recv function: while recv just hands back a block of > >> data, stream_body reads the whole of the request and calls a ChunkFun > >> parameter on each block of data that it reads. I think of this as a > >> push-based approach: the entire stream is read and pushed into a > callback > >> function, one block at a time. > >> > >> I can think of three ways to fix the mismatch between the pull and > >> push-based approaches and provide chunked multipart support: > >> > >> 1. Rework parse_multipart_request to be push-based. This would allow > >> reuse of stream_body, but at the cost of turning existing code > inside out > >> to fit with its push approach. > >> 2. Create a pull-based version of stream_body and probably try to > get in > >> incorporated into Mochiweb. But having two similar versions of the > same > >> code like this doesn't feel right. > >> 3. Convert stream_body from push-based to pull-based by spawning it > in a > >> new process that sends each block of data back to the > >> parse_multipart_request DataFun and then blocks until the message is > >> acknowledged. The DataFun receives the data when it needs to fetch > the next > >> block, and then sends an acknowledgement. > >> > >> The third option feels neatest and is my preferred route. But my > ignorance > >> of Erlang means that I don't know whether this is potentially expensive. > >> While a new process is very cheap, it would mean that all the request > data > >> is copied from that process to parse_multipart_request, and I don't > know if > >> that is very costly. That sort of copying already goes on > >> in couch_doc:doc_from_multi_part_stream where the parser is spawned off > and > >> copies each document and attachment back to the parent process but I > don't > >> know if that means the copying is cheap, or if it's an unavoidable evil > >> that shouldn't be reproduced elsewhere. > >> > >> I'd really appreciate any advice that the group can give me on the best > >> option to follow, and why, or suggestions for options that I've missed > >> altogether. Thanks in advance for your help, > >> > >> Nick >
