Re: Attachment Replication Problem

Antony Blakey Fri, 15 May 2009 17:16:55 -0700


On 15/05/2009, at 2:44 PM, Antony Blakey wrote:

I have a 3.5G Couchdb database, consisting of 1000 small documents,each with many attachments (0-30 per document), each attachmentvarying wildly in size (1K..10M).
To test replication I am running a server on my MBPro and anotherunder Ubuntu in VMWare on the same machine. I'm testing using a puretrunk.
Doing a pull-replicate from OSX to Linux fails to complete. Thepoint at which it fails is constant. I've added some debug logs intocouch_rep/attachment_loop like this: http://gist.github.com/112070and made the suggested "couch_util:should_flush(1000)" mod to tryand guarantee progress (but to no avail). The debug output showsthis: http://gist.github.com/112069 and the document it seems tofail on is this: http://gist.github.com/112074 . I'm only juststarting to look at this - any pointers would be appreciated.


I put some more logging in attachment_loop, specifically this:

        {ibrowse_async_response, ReqId, Data} ->

?LOG_DEBUG("ATTACHMENT_LOOP: ibrowse_async_response DataA ~p", [Url]),

            receive {From, gimme_data} -> From ! {self(), Data} end,

?LOG_DEBUG("ATTACHMENT_LOOP: ibrowse_async_response DataB ~p", [Url]),

            attachment_loop(ReqId);

The result of this is to see an enormous number of 'Data A' logswithout a corresponding 'Data B'. This happens becausemake_attachment_stub_receiver uses a promise to read the data, createdlike this:


        ResponseCode >= 200, ResponseCode < 300 ->
            % the normal case
            Pid ! {self(), continue},
            %% this function goes into the streaming attachment code.

%% It gets executed by the replication gen_server, so itcan't

            %% be the one to actually receive the ibrowse data.
            {ok, fun() ->
                Pid ! {self(), gimme_data},
                receive {Pid, Data} -> Data end
            end};

It seems that the promise is forced (e.g. the data read) only when thedocuments are checkpointed. If, as in my case, you have lots of smalldocuments with many attachments, this results in massive numbers ofopen connections to download the attachments, each blocked reading thefirst bit of data from the first chunk, because the checkpointingoccurs by default after 10MB of document data has been read, excludingattachments. In any case purely using size as a trigger won't work ifyou have lots of small documents with lots of small attachments. Itwould seem that the checkpointing, and hence forcing of the http-reading promises needs to also account for the number of promises.

To overcome this problem I used couch_util:should_flush(1) to ensurethat each document would be checkpointed, but that simply demonstratedthat this isn't the cause of the 100% repeatable replication hang thatI have. Now I get a log trace like this: http://gist.github.com/112512(ignoring the crap at the end of each log statement, which is myincompleted attempt to link each log to the associated url).


Anyone with any thoughts?

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

What can be done with fewer [assumptions] is done in vain with more
  -- William of Ockham (ca. 1285-1349)

Re: Attachment Replication Problem

Reply via email to