On 8/3/23 12:47 PM, Joel C. Ewing wrote:
The hardware is designed with redundancy to detect failures in
components (processors, memory, I/O subsystems, interconnection cables),
correct any resulting data errors where possible, retry a failed
operation using different hardware components where appropriate, vary a
failing component off line, and in many cases allow concurrent repair of
failing components while production continues. Undetected hardware
errors don't happen.
Save for retrying a failed operation the rest of those statements
weren't specific to IBM mainframes.
I remember reading about a Unix server being demonstrated at a trade
show that was running applications interactively wherein the
demonstrators removed all but one CPU book from the system, reinserted
the removed CPU books, then removed the one they hadn't removed, and
then reinserted it. At a later demonstration they took a cup of water
and pored it into the top of the system. What was running continued to
run in both demonstrations. The real time demo programs didn't even
stutter. What was obvious was that other non-real-time programs running
on the system slowed down as the OS reacted to hardware going offline
and rescheduling tasks on the remaining online CPUs. Monitoring agents
lit up like a Christmas tree as they removed CPU books but became
happier as they were re-inserted.
My understanding was that this was a system that was shipping in the mid
to late '90s and people were buying them. Thus not a demonstration special.
I don't remember if this was an HP SuperDome running HP-UX or a Sun
Enterprise 10000 running Solaris.
RAS is not specific to IBM. Though I do think that IBM trademarked the
name / phrase.
I'm not aware of any x86_64 servers being anywhere near this level of
reliability.
Aside: I think much of the Unix industry decided to move complexity and
cost out of the hardware and instead put it into software that runs on
more commodity / inexpensive hardware.
Having a super reliable basket with all your eggs in it is still all
your eggs in one basket.
z/OS not only coordinates with the hardware when resources visible to
z/OS are affected by failures and concurrent maintenance, it is also
designed with the philosophy that software failures may occur within
parts of the operating system, either from a hardware failure or a
system software bug. System recovery routines exist to clean up after
such failures, limit what running address spaces are affected, and allow
production to continue in unaffected address spaces.
I can't enumerate things, but I feel like non-mainframes have things
that can speak to this.
Another important feature of z/OS that requires some hardware
coordination is the System Measurement Facility that gathers measurement
of system activity and resource usage at a level to support performance
tuning or billing based on resource usage.
How much of SMF is hardware vs software?
System accounting -- originally for billing -- has been used for a long
time to provide information for system scaling.
Aside from fact that z/OS is closed-source and only licensed by IBM to
specific hardware, if you could somehow succeed in running it under
Linux or on non-z hardware, it would lose the reliability, availability,
and serviceability it gets from that hardware/software synergy that
makes it an ideal production platform for critical workloads.
There is an entire hobby genre doing exactly this.
I absolutely agree that it does not have anywhere near the same RAS that
z Series has. But I also realize that not everybody needs, much less is
willing to pay for, such RAS features.
It doesn't matter how reliable the single basket is if the network
connectivity into the facility is cut. -- This is one of the places
that having redundancy higher in the application stack and distributing
load geographically starts to shine.
An IBM mainframe is a very impressive system. A Cadillac is a very
impressive car. But using an IBM mainframe to serve files in a small
office is about as appropriate as using the Cadillac to deliver pizzas.
--
Grant. . . .
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN