On 4 August 2014 15:31:40 CEST, Martin Vaeth <mar...@mvath.de> wrote:
>J. Roeleveld <jo...@antarean.org> wrote:
>>>
>>> So you have a command which might break due to hardware error
>>> and cannot be rerun. I cannot see how any general-purpose scheduler
>>> might help you here: You either need to be able to split your
>command
>>> into several (sequential) commands or you need something adapted
>>> for your particular command.
>>
>> A general-purpose scheduler can work, as they do exist.
>
>I doubt that they can solve your problem.
>Let me repeat: You have a single program which accesses the database
>in a complex way and somewhere in the course of accessing it, the
>machine (or program) crashes.
>No general-purpose program can recover from this: You need
>particular knowledge of the database and the program if you even
>want to have a *chance* to recover from such a situation.
>A program with such a particular knowledge can hardly be called
>"general-purpose".

The scheduler needs to be able to show which process failed/didn't finish. 
Then humans need to ensure that part finishes/reruns properly.
Then humans need to be able to mark the failed process as succeeded.

At which point the scheduler continues with the schedule(s)

>> If, during one of these steps, the database or ETL process suffers a
>> crash, the activities of the ETL process need to be rolled back to
>> the point where you can restart it.
>
>I agree, but you need particular knowledge of the database and
>your tasks to do this which is far beyond the job of a scheduler.
>As already mentioned by someone in this thread, your problem needs
>to be solved on the level of the database (using
>snapshopt capabilities etc.)

Or human intervention. Which requires a clear indication of where it went wrong 
and allows a simple action to continue the schedule from where it was after 
these humans solved the issues and ensure consistency.

>>> In order to deal with case 1., you can regularly (e.g. each minute)
>>> dump the output of "schedule list" (possibly suppressing
>non-important
>>> data through the options to keep it short).
>>
>> Or all the necessary information is kept in-sync on persistent
>storage.
>> This would then also allow easy fail-over if the master-schedule-node
>> fails
>
>No, it wouldn't, since jobs just finishing and wanting to report their
>status cannot do this when there is no server. You would need a rather
>involved protocol to deal with such situations dynamically.
>It can certainly be done, but it is not something which can
>easily be "added" as a feature: If this is required, it has to be the
>fundamental concept from the very beginning and everything else has to
>follow this first aim. You need different protocols than TCP sockets,
>to start with; something like "dbus over IP" with servers being able
>to announce their new presence, etc.

I think it's doable with standard networking protocols.
But, either you have a master server which controls everything. Or you have a 
master process which has failover functionality using classical distributed 
software techniques.

These emails are actually quite useful as I am getting a clear pucture in my 
head on how I could approach this properly.

Thanks,

Joost

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Reply via email to