Re: fault tolerance in the new replicator

Adam Kocoloski Sat, 07 Mar 2009 22:30:53 -0800

I reconfigured things so that checkpoints are also saved every timethe replicator flushes its internal document buffer. Looks good sofar. Will try to get it committed to SVN tomorrow. Best,


Adam


On Mar 7, 2009, at 10:28 PM, Adam Kocoloski wrote:

And here I thought I was done with replication work for awhile ...
When the new replicator streams an attachment, it uses ibrowsewithout trying to do any error handling. If the request fails, itkills the whole replication without giving us a chance to checkpointanything. That may not be such a great idea; transient networkfailures are a fact of life. A possible solution is to have thereplicator trap exits. When an attachment request fails, thereplicator can catch the exit, roll back and retry.
I committed some updates to my github branch. I haven't had achance to do extensive testing yet, but the main ideas are
* replicator traps exits. Any linked process that dies (usually astreaming attachment loop) causes the replicator to respawn thedocument enumerator, which has the effect of redoing the replicationfor the last < 100 updates. There's no limit to the number of timesthis loop can occur, but I think that's OK because ...
* document requests are still made by the gen_server processitself. We had our own manual retry framework for these; thatframework is still in place. After 10 failed attempts for aparticular document, the replicator will terminate withhttp_request_failed.
* If an abnormal termination occurs, the replicator will try to savethe current status in the _local docs on source and target. If it'ssuccessful, the next replicator can pick up where this one left off.
* I also tried to clean up the error messages a bit, returning{"error":"http_request_failed", "reason":Url} instead of dumping thefirst line of the Erlang traceback on the client.
Hopefully I can commit this tomorrow after some further testing.Cheers, Adam
http://github.com/kocolosk/couchdb/tree/otpify-replication

Re: fault tolerance in the new replicator

Reply via email to