On 15/05/2009, at 2:44 PM, Antony Blakey wrote:
I have a 3.5G Couchdb database, consisting of 1000 small documents,
each with many attachments (0-30 per document), each attachment
varying wildly in size (1K..10M).
To test replication I am running a server on my MBPro and another
under Ubuntu in VMWare on the same machine. I'm testing using a pure
trunk.
Doing a pull-replicate from OSX to Linux fails to complete. The
point at which it fails is constant. I've added some debug logs into
couch_rep/attachment_loop like this: http://gist.github.com/112070
and made the suggested "couch_util:should_flush(1000)" mod to try
and guarantee progress (but to no avail). The debug output shows
this: http://gist.github.com/112069 and the document it seems to
fail on is this: http://gist.github.com/112074 . I'm only just
starting to look at this - any pointers would be appreciated.
I put some more logging in attachment_loop, specifically this:
{ibrowse_async_response, ReqId, Data} ->
?LOG_DEBUG("ATTACHMENT_LOOP: ibrowse_async_response Data
A ~p", [Url]),
receive {From, gimme_data} -> From ! {self(), Data} end,
?LOG_DEBUG("ATTACHMENT_LOOP: ibrowse_async_response Data
B ~p", [Url]),
attachment_loop(ReqId);
The result of this is to see an enormous number of 'Data A' logs
without a corresponding 'Data B'. This happens because
make_attachment_stub_receiver uses a promise to read the data, created
like this:
ResponseCode >= 200, ResponseCode < 300 ->
% the normal case
Pid ! {self(), continue},
%% this function goes into the streaming attachment code.
%% It gets executed by the replication gen_server, so it
can't
%% be the one to actually receive the ibrowse data.
{ok, fun() ->
Pid ! {self(), gimme_data},
receive {Pid, Data} -> Data end
end};
It seems that the promise is forced (e.g. the data read) only when the
documents are checkpointed. If, as in my case, you have lots of small
documents with many attachments, this results in massive numbers of
open connections to download the attachments, each blocked reading the
first bit of data from the first chunk, because the checkpointing
occurs by default after 10MB of document data has been read, excluding
attachments. In any case purely using size as a trigger won't work if
you have lots of small documents with lots of small attachments. It
would seem that the checkpointing, and hence forcing of the http-
reading promises needs to also account for the number of promises.
To overcome this problem I used couch_util:should_flush(1) to ensure
that each document would be checkpointed, but that simply demonstrated
that this isn't the cause of the 100% repeatable replication hang that
I have. Now I get a log trace like this: http://gist.github.com/112512
(ignoring the crap at the end of each log statement, which is my
incompleted attempt to link each log to the associated url).
Anyone with any thoughts?
Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
What can be done with fewer [assumptions] is done in vain with more
-- William of Ockham (ca. 1285-1349)