On 25/05/2016 22:40, Dan Carpenter wrote:

So I am curious about using elixir to build a distributed cron system.

...

We currently have thousands of jobs per node in memory, unfortunately when a node goes down so do its jobs. When the cron job needs to run the payload in memory is sent for processing and execution. We are trying to think through a way to have another node take over a failed nodes job


A couple of people are leaning towards a zookeeper master / slave system for this to solve notifications but we are still faced with how to quickly have another node take over a failed nodes jobs.


So, the use case gets in the way a little here and makes it sound more complex. Yet actually the underlying problem is both incredibly complex and subtle and hard to get right also...

So, competing goals:
- Single point of truth, because you don't want two nodes running the same job (ie we want implement job at least once, preferably at most once (second can't be guaranteed, hence make jobs idempotent where possible))
- Distributed truth, in case our single point of knowledge dies...

So, you need a distributed log, with ordered events

- Cheat way is to slap redis in and hope no one notices the single point of failure

- Better way is to build a distributed consensus service (hard, incredibly easy to get the theoretical algorithm right, but in practice it breaks in the real world)

- Alternative way is to grab someone's debugged system, eg zookeeper


Now, grab some papers on Zookeeper, Paxos and Raft. Fascinating stuff, and provably these algorithms allow you to create a distributed consensus on some "truth". However, there are several problems:

1) Its expensive to do this for every decision...

2) It's extremely complicated to get the implementation correct...

One way that seems common to solve 1) is to use the consensus ONLY to elect a single point of truth, then stick with that single point until "something happens" (tm) and then elect a new single point of truth

Incremental improvement, to avoid the "single point" becoming a bottleneck you can use some consistent hash (insert favourite technique here) to split the load across multiple processes, however, this is really just the same thing (it's like saying that server over there does everything starting with "a" and over there does "b", etc - really it's no different to saying "that server does everything", it's just that you partitioned the problem so "everything" is a smaller scope)


So many people use Zookeeper (et al) as a way of electing a leader (single point of failure) for all (or a partition of all) decisions and then that leader ensures decisions are serialised, acknowledge appropriately.

There are lots of interesting ways to partition the load, eg "rings" are quite popular at the moment, eg Riak.


Of course, if that all sounds complex... I guess it is. Most people cheat and claim all kinds of perfect, but if you really, really care about stuff working in the presence of failure, then you HAVE to do something like the above. Anything else will work "most of the time", which may be good enough (?), but at least understand why being properly robust is different and hard...

This link should be fascinating if you want to really see if you are safe in the face of failure!
    https://aphyr.com/tags/Jepsen

Good luck!

Ed W

--
You received this message because you are subscribed to the Google Groups 
"elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elixir-lang-talk/86dd2044-8ca5-487a-b1a7-862a49487461%40wildgooses.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to