Back in the day, we worked on RAS. So we put in error detection
hardware (sometimes that was "firmware, or macrocode) and IBM and
all our competitors were doing the same. And the idea was to have
redundant power supplies so that a CE could do maint, and not
take down the system. And if possible, redundant channel paths to
a device controller so that you could pull a channel cable and
replace it.
Today, with IBM, you can add or subtract CPUs while the machine
is running. But, at least with the z15s, you could not add RAM
without taking the system down, as in power it down.
So that would be a RAS hit, or, cause you to miss your 99.999
target.
For people who do hardware and to some degree software (O/S
stuff), you do all you can to recover from any problem. I like VM
and its ability to see it is injured and it will IPL itself. But,
to keep those SLAs, there is SSI. So an LPAR can move its
workload to another LPAR (PAIRs determined in advance here) and
keep that work running. We did this at a large health insurer so
that we could do VM upgrades with no outages.
So how you measure that up time depends on the equipment and
ability to do HOT SWAP, and related so you do not take an outage.
What happens if a WINTEL server running MQ buys the farm? Those
inflight transactions going through that server may time out and
have to be re-driven. Is this considered an outage? Not if you
have a second one handling the load and it takes over. But that
one or 10(?) users may see an error message. Does that count as
an outage if the user only loses a few seconds in getting an
answer? Or a Pharmacy getting info? Or an OR getting info on drug
interactions?
Need some perspective.
Steve Thompson
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN