On 15/05/2009, at 2:44 PM, Antony Blakey wrote:

I have a 3.5G Couchdb database, consisting of 1000 small documents, each with many attachments (0-30 per document), each attachment varying wildly in size (1K..10M).

To test replication I am running a server on my MBPro and another under Ubuntu in VMWare on the same machine. I'm testing using a pure trunk.

Doing a pull-replicate from OSX to Linux fails to complete. The point at which it fails is constant. I've added some debug logs into couch_rep/attachment_loop like this: http://gist.github.com/112070 and made the suggested "couch_util:should_flush(1000)" mod to try and guarantee progress (but to no avail). The debug output shows this: http://gist.github.com/112069 and the document it seems to fail on is this: http://gist.github.com/112074 . I'm only just starting to look at this - any pointers would be appreciated.

I put some more logging in attachment_loop, specifically this:

        {ibrowse_async_response, ReqId, Data} ->
?LOG_DEBUG("ATTACHMENT_LOOP: ibrowse_async_response Data A ~p", [Url]),
            receive {From, gimme_data} -> From ! {self(), Data} end,
?LOG_DEBUG("ATTACHMENT_LOOP: ibrowse_async_response Data B ~p", [Url]),
            attachment_loop(ReqId);

The result of this is to see an enormous number of 'Data A' logs without a corresponding 'Data B'. This happens because make_attachment_stub_receiver uses a promise to read the data, created like this:

        ResponseCode >= 200, ResponseCode < 300 ->
            % the normal case
            Pid ! {self(), continue},
            %% this function goes into the streaming attachment code.
%% It gets executed by the replication gen_server, so it can't
            %% be the one to actually receive the ibrowse data.
            {ok, fun() ->
                Pid ! {self(), gimme_data},
                receive {Pid, Data} -> Data end
            end};

It seems that the promise is forced (e.g. the data read) only when the documents are checkpointed. If, as in my case, you have lots of small documents with many attachments, this results in massive numbers of open connections to download the attachments, each blocked reading the first bit of data from the first chunk, because the checkpointing occurs by default after 10MB of document data has been read, excluding attachments. In any case purely using size as a trigger won't work if you have lots of small documents with lots of small attachments. It would seem that the checkpointing, and hence forcing of the http- reading promises needs to also account for the number of promises.

To overcome this problem I used couch_util:should_flush(1) to ensure that each document would be checkpointed, but that simply demonstrated that this isn't the cause of the 100% repeatable replication hang that I have. Now I get a log trace like this: http://gist.github.com/112512 (ignoring the crap at the end of each log statement, which is my incompleted attempt to link each log to the associated url).

Anyone with any thoughts?

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

What can be done with fewer [assumptions] is done in vain with more
  -- William of Ockham (ca. 1285-1349)



Reply via email to