Hi Antony,

On May 16, 2009, at 10:39 AM, Antony Blakey wrote:

I can confirm that the target and source of replicated resources affected by this issue are identical with this fix, and both are correct i.e. uncorrupted, although this is only according to the failures I've seen.

Thanks!  Makes me feel better, at least.

Now, on to the checkpointing conditions. I think there's some confusion about the attachment workflow. Attachments are downloaded _immediately_ and in their entirety by ibrowse, which then sends the data as 1MB binary chunks to the attachment receiver processes.

Are they downloaded to disk by ibrowse?

No, I don't believe so. ibrowse accepts a {stream_to, pid()} option. It accumulates packets until it reaches a threshold configurable by {stream_chunk_size, integer()} (default 1MB), then sends the data to the Pid. I don't think ibrowse is writing to disk at any point in the process. We do see that when streaming really large attachments, ibrowse becomes the biggest memory user in the emulator.

ibrowse does offer a {save_response_to_file, boolean()|filename()} option that we could possibly leverage.

In another thread Matt Goodall suggested checkpointing after a certain amount of time has passed. So we'd have a checkpointing algo that considers

* memory utilization
* number of pending writes
* time elapsed

That seems to cover both resource usage and incremental progress. As far as the couch_util:should_flush mechanism is concerned, I think a good idea would be to commit 1 document, then 2, then 4 i.e. a binary increasing window which adapts well to both unreliable and reliable connections without requiring configuration, which is tricky because you may want to run the system in a variety of scenarios, and you might not know what the failure characteristics are (and they may change over time).

It sounds like a good idea. I had thought about doing the same for the process that pulls new docs from the source server, so that we could do a better job of filling up the pipes when we're dealing with the common case of small documents without significant attachment data.

While we on this - any idea about why couchdb is quiting during replication? It's not giving me any errors.

Errm, no, I'm afraid I don't have any idea there. I remember one or two other reports in JIRA that sounds similar, but I've not been able to reproduce them. Are you keeping an eye on the memory usage? I think an out of memory error can trigger this sudden death in Erlang. Sorry, that's the best I've got at the moment.

Adam

Reply via email to