On 4 August 2014 10:41:04 CEST, Martin Vaeth <mar...@mvath.de> wrote: >J. Roeleveld <jo...@antarean.org> wrote: >> >> With the kind of schedules I am working with (and I believe Alan will >> also end up with), restarting the whole process from the start can >> lead to issues. >> Finding out how far the process got before the service crashed can >become >> rather complex. > >I am not sure whether I understand this correctly:
The schedules I am used to dealing with easily span 8 - 14 hours with occasionally even over a week. These schedules then also can't be restarted from the beginning when they stop halfway through without risking massive consistency problems in the final data. And then multiple of those starting at random times with occasionally a whole bunch of the same schedule put into the queue with dependencies to the previous run. If, during that time, one of the machines has a hardware failure or the scheduling process crashes on one or more of the servers, the last state needs to be recoverable. If you have to clean up the environment and bring it back to a state where you can restart the schedules, it saves time if you know which commands and tasks were actually running at the time. For this, the schedules, queues and current state for each node needs to be stored on persistent storage. Hope this clarifies it all a bit. -- Joost -- Sent from my Android device with K-9 Mail. Please excuse my brevity.