On 25/05/2016 22:40, Dan Carpenter wrote:
So I am curious about using elixir to build a distributed cron system.
...
We currently have thousands of jobs per node in memory, unfortunately
when a node goes down so do its jobs. When the cron job needs to run
the payload in memory is sent for processing and execution. We are
trying to think through a way to have another node take over a failed
nodes job
A couple of people are leaning towards a zookeeper master / slave
system for this to solve notifications but we are still faced with how
to quickly have another node take over a failed nodes jobs.
So, the use case gets in the way a little here and makes it sound more
complex. Yet actually the underlying problem is both incredibly complex
and subtle and hard to get right also...
So, competing goals:
- Single point of truth, because you don't want two nodes running the
same job (ie we want implement job at least once, preferably at most
once (second can't be guaranteed, hence make jobs idempotent where
possible))
- Distributed truth, in case our single point of knowledge dies...
So, you need a distributed log, with ordered events
- Cheat way is to slap redis in and hope no one notices the single point
of failure
- Better way is to build a distributed consensus service (hard,
incredibly easy to get the theoretical algorithm right, but in practice
it breaks in the real world)
- Alternative way is to grab someone's debugged system, eg zookeeper
Now, grab some papers on Zookeeper, Paxos and Raft. Fascinating stuff,
and provably these algorithms allow you to create a distributed
consensus on some "truth". However, there are several problems:
1) Its expensive to do this for every decision...
2) It's extremely complicated to get the implementation correct...
One way that seems common to solve 1) is to use the consensus ONLY to
elect a single point of truth, then stick with that single point until
"something happens" (tm) and then elect a new single point of truth
Incremental improvement, to avoid the "single point" becoming a
bottleneck you can use some consistent hash (insert favourite technique
here) to split the load across multiple processes, however, this is
really just the same thing (it's like saying that server over there does
everything starting with "a" and over there does "b", etc - really it's
no different to saying "that server does everything", it's just that you
partitioned the problem so "everything" is a smaller scope)
So many people use Zookeeper (et al) as a way of electing a leader
(single point of failure) for all (or a partition of all) decisions and
then that leader ensures decisions are serialised, acknowledge
appropriately.
There are lots of interesting ways to partition the load, eg "rings" are
quite popular at the moment, eg Riak.
Of course, if that all sounds complex... I guess it is. Most people
cheat and claim all kinds of perfect, but if you really, really care
about stuff working in the presence of failure, then you HAVE to do
something like the above. Anything else will work "most of the time",
which may be good enough (?), but at least understand why being properly
robust is different and hard...
This link should be fascinating if you want to really see if you are
safe in the face of failure!
https://aphyr.com/tags/Jepsen
Good luck!
Ed W
--
You received this message because you are subscribed to the Google Groups
"elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elixir-lang-talk/86dd2044-8ca5-487a-b1a7-862a49487461%40wildgooses.com.
For more options, visit https://groups.google.com/d/optout.