On 8/3/23 12:47 PM, Joel C. Ewing wrote:
The hardware is designed with redundancy to detect failures in components (processors, memory, I/O subsystems, interconnection cables), correct any resulting data errors where possible, retry a failed operation using different hardware components where appropriate, vary a failing component off line, and in many cases allow concurrent repair of failing components while production continues.  Undetected hardware errors don't happen.

Save for retrying a failed operation the rest of those statements weren't specific to IBM mainframes.

I remember reading about a Unix server being demonstrated at a trade show that was running applications interactively wherein the demonstrators removed all but one CPU book from the system, reinserted the removed CPU books, then removed the one they hadn't removed, and then reinserted it. At a later demonstration they took a cup of water and pored it into the top of the system. What was running continued to run in both demonstrations. The real time demo programs didn't even stutter. What was obvious was that other non-real-time programs running on the system slowed down as the OS reacted to hardware going offline and rescheduling tasks on the remaining online CPUs. Monitoring agents lit up like a Christmas tree as they removed CPU books but became happier as they were re-inserted.

My understanding was that this was a system that was shipping in the mid to late '90s and people were buying them. Thus not a demonstration special.

I don't remember if this was an HP SuperDome running HP-UX or a Sun Enterprise 10000 running Solaris.

RAS is not specific to IBM. Though I do think that IBM trademarked the name / phrase.

I'm not aware of any x86_64 servers being anywhere near this level of reliability.

Aside: I think much of the Unix industry decided to move complexity and cost out of the hardware and instead put it into software that runs on more commodity / inexpensive hardware.

Having a super reliable basket with all your eggs in it is still all your eggs in one basket.

z/OS not only coordinates with the hardware when resources visible to z/OS are affected by failures and concurrent maintenance, it is also designed with the philosophy that software failures may occur within parts of the operating system, either from a hardware failure or a system software bug.   System recovery routines exist to clean up after such failures, limit what running address spaces are affected, and allow production to continue in unaffected address spaces.

I can't enumerate things, but I feel like non-mainframes have things that can speak to this.

Another important feature of z/OS that requires some hardware coordination is the System Measurement Facility that gathers measurement of system activity and resource usage at a level to support performance tuning or billing based on resource usage.

How much of SMF is hardware vs software?

System accounting -- originally for billing -- has been used for a long time to provide information for system scaling.

Aside from fact that z/OS is closed-source and only licensed by IBM to specific hardware, if you could somehow succeed in running it under Linux or on non-z hardware, it would lose the reliability, availability, and serviceability it gets from that hardware/software synergy that makes it an ideal production platform for critical workloads.

There is an entire hobby genre doing exactly this.

I absolutely agree that it does not have anywhere near the same RAS that z Series has. But I also realize that not everybody needs, much less is willing to pay for, such RAS features.

It doesn't matter how reliable the single basket is if the network connectivity into the facility is cut. -- This is one of the places that having redundancy higher in the application stack and distributing load geographically starts to shine.

An IBM mainframe is a very impressive system. A Cadillac is a very impressive car. But using an IBM mainframe to serve files in a small office is about as appropriate as using the Cadillac to deliver pizzas.



--
Grant. . . .

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to