Well, given that I work for a financial institution, I can say that in many cases 
"stopping everything" is exactly what DOES happen.  "Charging ahead", knowing you're 
dealing with potentially
corrupted data and not knowing the extent of the problem, is irresponsible.

Aside from the financial implications, we have responsibilities to the customers and 
regulatory agencies.  Fortunately, situations like this are very rare, and the folks 
here are VERY good at fixing
things quickly.  But 0C7's in batch jobs do happen, and they still get fixed manually.

For the REALLY critical stuff, parallel redundant systems are used (Tandem, etc.), on 
the theory that a single failure can't knock down more than part of the application.

In a previous job I worked for a hospital.  Most of the systems we managed were NOT 
involved in direct patient care, and it's a good thing.  When we DID start getting 
involved in that area, it became
VERY scary.


>
> This is good *IF* it is not a critical system.  If the application is
> moving billions of financial transactions around the world
> and it costs
> brokers millions of dollars for every minute of down time just "stop
> everything, and let someone fix it" is not a good answer.
> The application
> needs to identify the failure point, establish what is likely
> good or bad
> data and charge ahead.  (After leaving a solid trail of bread
> crumbs for
> someone to follow....)
>
> - Dale
>

Reply via email to