On 04/08/2014 21:46, J. Roeleveld wrote:
> On 4 August 2014 15:35:41 CEST, Alan McKinnon <alan.mckin...@gmail.com> wrote:
>> On 04/08/2014 15:31, Martin Vaeth wrote:
>>> J. Roeleveld <jo...@antarean.org> wrote:
>>>>>
>>>>> So you have a command which might break due to hardware error
>>>>> and cannot be rerun. I cannot see how any general-purpose scheduler
>>>>> might help you here: You either need to be able to split your
>> command
>>>>> into several (sequential) commands or you need something adapted
>>>>> for your particular command.
>>>>
>>>> A general-purpose scheduler can work, as they do exist.
>>>
>>> I doubt that they can solve your problem.
>>> Let me repeat: You have a single program which accesses the database
>>> in a complex way and somewhere in the course of accessing it, the
>>> machine (or program) crashes.
>>> No general-purpose program can recover from this: You need
>>> particular knowledge of the database and the program if you even
>>> want to have a *chance* to recover from such a situation.
>>> A program with such a particular knowledge can hardly be called
>>> "general-purpose".
>>
>>
>> Joost,
>>
>> Either make the ETL tool pick up where it stopped and continue as it is
>> the only that knows what it was doing and how far it got. Or, wrap the
>> entire script in a single transaction.
> 
> Alan,
> 
> That would be the ideal solution.

You have the same concerns I do - how do you make a transaction around
500 million rows. So I asked the in-house expert - Mrs Alan :-)


> However, a single transaction dealing with around 500,000,000 rows will get 
> me shot by the DBAs :)
> (Never mind that the performance of this will be such that having it all done 
> by an office full of secretaries might be quicker.)

She reckons an ETL job *must* be self-contained; if it isn't then it's
broken by design. It must be idempotent too, which can be as simple as
"Truncate, Load, Commit"

> Having the ETL process clever enough to be able to pick up from any point 
> requires a degree of forward thinking and planning that is never done in real 
> life.
> I would love to design it like that as it isn't too difficult. But I always 
> get brought into these projects when implementing these structures will 
> require a full rewrite and getting the original architects to admit their 
> design can't be made restartable without human intervention.


I agree with that design actually - it's the job of the hardware and OS
guys to make stuff reliable that the application layer can rely on. When
a SAN connection goes away, it usually comes back and the app layer just
carries on (never mind that it retried 100 times meanwhile).

Sometimes this doesn't work out. The easiest, cheapest and quickest way
to handle it is to just restart the whole job from the beginning. This
offends the engineer in us sometimes, but it really is the best way and
all of Unix is built on this very idea :-)

If the SAn goes away too often and it causes issues, the manybe the best
approach is to get the SAN and facilities guys to get their act together

> At which point the business simply says it is acceptable to have people do a 
> manual rollback and restart the schedules from wherever it went wrong.

Exactly. One of the few cases where business has the correct idea.
There's only some many pennies to spend and so many dollars to be delivered.


> 
> I'm sure your wife has similar experiences as this is why these projects are 
> always late to deliver and over budget.

She says her projects are subject to the same universal inviolate rule
as mine:

time and cost is always best engineering estimate times pi

We learn to deal with it. Which brings us back to Martin's initial
statement: a scheduler cannot deal with any of this, the job itself
must. It's an unpredictable event and schedulers can only deal with
predictable events


-- 
Alan McKinnon
alan.mckin...@gmail.com


Reply via email to