Hi Antony,
On May 16, 2009, at 10:39 AM, Antony Blakey wrote:
I can confirm that the target and source of replicated resources
affected by this issue are identical with this fix, and both are
correct i.e. uncorrupted, although this is only according to the
failures I've seen.
Thanks! Makes me feel better, at least.
Now, on to the checkpointing conditions. I think there's some
confusion about the attachment workflow. Attachments are
downloaded _immediately_ and in their entirety by ibrowse, which
then sends the data as 1MB binary chunks to the attachment receiver
processes.
Are they downloaded to disk by ibrowse?
No, I don't believe so. ibrowse accepts a {stream_to, pid()} option.
It accumulates packets until it reaches a threshold configurable by
{stream_chunk_size, integer()} (default 1MB), then sends the data to
the Pid. I don't think ibrowse is writing to disk at any point in
the process. We do see that when streaming really large attachments,
ibrowse becomes the biggest memory user in the emulator.
ibrowse does offer a {save_response_to_file, boolean()|filename()}
option that we could possibly leverage.
In another thread Matt Goodall suggested checkpointing after a
certain amount of time has passed. So we'd have a checkpointing
algo that considers
* memory utilization
* number of pending writes
* time elapsed
That seems to cover both resource usage and incremental progress. As
far as the couch_util:should_flush mechanism is concerned, I think a
good idea would be to commit 1 document, then 2, then 4 i.e. a
binary increasing window which adapts well to both unreliable and
reliable connections without requiring configuration, which is
tricky because you may want to run the system in a variety of
scenarios, and you might not know what the failure characteristics
are (and they may change over time).
It sounds like a good idea. I had thought about doing the same for
the process that pulls new docs from the source server, so that we
could do a better job of filling up the pipes when we're dealing with
the common case of small documents without significant attachment data.
While we on this - any idea about why couchdb is quiting during
replication? It's not giving me any errors.
Errm, no, I'm afraid I don't have any idea there. I remember one or
two other reports in JIRA that sounds similar, but I've not been able
to reproduce them. Are you keeping an eye on the memory usage? I
think an out of memory error can trigger this sudden death in Erlang.
Sorry, that's the best I've got at the moment.
Adam