Re: [elixir-talk:13178] Advice concerning Elixir/OTP for Distributed Cron system

Ed W Thu, 26 May 2016 04:38:57 -0700

On 25/05/2016 22:40, Dan Carpenter wrote:


So I am curious about using elixir to build a distributed cron system.

...

We currently have thousands of jobs per node in memory, unfortunatelywhen a node goes down so do its jobs. When the cron job needs to runthe payload in memory is sent for processing and execution. We aretrying to think through a way to have another node take over a failednodes job

A couple of people are leaning towards a zookeeper master / slavesystem for this to solve notifications but we are still faced with howto quickly have another node take over a failed nodes jobs.

So, the use case gets in the way a little here and makes it sound morecomplex. Yet actually the underlying problem is both incredibly complexand subtle and hard to get right also...


So, competing goals:

- Single point of truth, because you don't want two nodes running thesame job (ie we want implement job at least once, preferably at mostonce (second can't be guaranteed, hence make jobs idempotent wherepossible))

- Distributed truth, in case our single point of knowledge dies...

So, you need a distributed log, with ordered events

- Cheat way is to slap redis in and hope no one notices the single pointof failure

- Better way is to build a distributed consensus service (hard,incredibly easy to get the theoretical algorithm right, but in practiceit breaks in the real world)


- Alternative way is to grab someone's debugged system, eg zookeeper

Now, grab some papers on Zookeeper, Paxos and Raft. Fascinating stuff,and provably these algorithms allow you to create a distributedconsensus on some "truth". However, there are several problems:


1) Its expensive to do this for every decision...

2) It's extremely complicated to get the implementation correct...

One way that seems common to solve 1) is to use the consensus ONLY toelect a single point of truth, then stick with that single point until"something happens" (tm) and then elect a new single point of truth

Incremental improvement, to avoid the "single point" becoming abottleneck you can use some consistent hash (insert favourite techniquehere) to split the load across multiple processes, however, this isreally just the same thing (it's like saying that server over there doeseverything starting with "a" and over there does "b", etc - really it'sno different to saying "that server does everything", it's just that youpartitioned the problem so "everything" is a smaller scope)

So many people use Zookeeper (et al) as a way of electing a leader(single point of failure) for all (or a partition of all) decisions andthen that leader ensures decisions are serialised, acknowledgeappropriately.

There are lots of interesting ways to partition the load, eg "rings" arequite popular at the moment, eg Riak.

Of course, if that all sounds complex... I guess it is. Most peoplecheat and claim all kinds of perfect, but if you really, really careabout stuff working in the presence of failure, then you HAVE to dosomething like the above. Anything else will work "most of the time",which may be good enough (?), but at least understand why being properlyrobust is different and hard...

This link should be fascinating if you want to really see if you aresafe in the face of failure!

    https://aphyr.com/tags/Jepsen

Good luck!

Ed W

--
You received this message because you are subscribed to the Google Groups 
"elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elixir-lang-talk/86dd2044-8ca5-487a-b1a7-862a49487461%40wildgooses.com.
For more options, visit https://groups.google.com/d/optout.

Re: [elixir-talk:13178] Advice concerning Elixir/OTP for Distributed Cron system

Reply via email to