Dan Creswell wrote:
I've yet to see exactly how Erlang does failure detection of processes.
 I guess there might be some timeout value somewhere in respect of
messages reaching a destination etc but I've not seen a description of
this aspect of Erlang.

Failure detection in Erlang appears to occur, less at the message level than at the process level. All processes whether local or remote are typically seen as being unreliable.

Further whilst Erlang might do failure detection (of a form) solving the
issues of failure are the difficult bit and I'm less convinced Erlang
offers much here.  For example, one solution to failure is replication
and it appears you are (unsurprisingly) left to do that for yourself
right now.  Putting my high-performance hat on I'd also point out that
replication has recognized limits especially when it's done with
transactions which leads to even more esoteric solutions that are
largely about appropriate architecture/interactions and less about
shared-nothing or message passing.

Erlang/OTP and failure handling are a *BIG* thing in Erlang, and are touted as one of it's main strengths along with concurrency. Again, I'd reiterate I don't have much practical experience with Erlang, but it does seem to have a lot to offer in this regard.

Erlang/OTP provide a notion of supervisor processes and supervision trees. Where failure/crashing of a process is detected by it's supervisor who can then handle the failure appropriately (usually by logging the error, or restarting the process according to a particular restart strategy). Hand in hand with this is Erlang's dynamic code updates, which mean that when part of a system crashes or fails you can fix the error and deploy the fix to the live system, restarting *JUST* the process that failed in a graceful manner.

It's claimed that these properties have led to Erlang systems being created with 99.9999999% (9 9's) reliability.

--
Rick Moynihan
Software Engineer
Calico Jack LTD
http://www.calicojack.co.uk/

Reply via email to