On 4 August 2014 15:31:40 CEST, Martin Vaeth <mar...@mvath.de> wrote: >J. Roeleveld <jo...@antarean.org> wrote: >>> >>> So you have a command which might break due to hardware error >>> and cannot be rerun. I cannot see how any general-purpose scheduler >>> might help you here: You either need to be able to split your >command >>> into several (sequential) commands or you need something adapted >>> for your particular command. >> >> A general-purpose scheduler can work, as they do exist. > >I doubt that they can solve your problem. >Let me repeat: You have a single program which accesses the database >in a complex way and somewhere in the course of accessing it, the >machine (or program) crashes. >No general-purpose program can recover from this: You need >particular knowledge of the database and the program if you even >want to have a *chance* to recover from such a situation. >A program with such a particular knowledge can hardly be called >"general-purpose".
The scheduler needs to be able to show which process failed/didn't finish. Then humans need to ensure that part finishes/reruns properly. Then humans need to be able to mark the failed process as succeeded. At which point the scheduler continues with the schedule(s) >> If, during one of these steps, the database or ETL process suffers a >> crash, the activities of the ETL process need to be rolled back to >> the point where you can restart it. > >I agree, but you need particular knowledge of the database and >your tasks to do this which is far beyond the job of a scheduler. >As already mentioned by someone in this thread, your problem needs >to be solved on the level of the database (using >snapshopt capabilities etc.) Or human intervention. Which requires a clear indication of where it went wrong and allows a simple action to continue the schedule from where it was after these humans solved the issues and ensure consistency. >>> In order to deal with case 1., you can regularly (e.g. each minute) >>> dump the output of "schedule list" (possibly suppressing >non-important >>> data through the options to keep it short). >> >> Or all the necessary information is kept in-sync on persistent >storage. >> This would then also allow easy fail-over if the master-schedule-node >> fails > >No, it wouldn't, since jobs just finishing and wanting to report their >status cannot do this when there is no server. You would need a rather >involved protocol to deal with such situations dynamically. >It can certainly be done, but it is not something which can >easily be "added" as a feature: If this is required, it has to be the >fundamental concept from the very beginning and everything else has to >follow this first aim. You need different protocols than TCP sockets, >to start with; something like "dbus over IP" with servers being able >to announce their new presence, etc. I think it's doable with standard networking protocols. But, either you have a master server which controls everything. Or you have a master process which has failover functionality using classical distributed software techniques. These emails are actually quite useful as I am getting a clear pucture in my head on how I could approach this properly. Thanks, Joost -- Sent from my Android device with K-9 Mail. Please excuse my brevity.