John Campbell wrote: >The zSeries is AT LEAST 5 9's hardware. Philipp Kern wrote: >Does that matter in today's world? Would you avoid building for failure >when a lot of the failure comes from software anyway? Do you then host >multiple Linux VMs on the same iron to account for that? If so, why >can't that scale horizontally?
That's a great question. I'll attempt an answer, and then if you have some follow-up questions, great. First of all, I hope we can agree that execution and data integrity are frequently important. If the processor occasionally reports "5" as the answer to 2+2, or if a bit gets flipped within one field in a fund transfer record, that glitch could be catastrophic. There's more technology in IBM Z and LinuxONE systems to prevent integrity errors than in any other computing platform (so far as I'm aware). Keeping electrons properly channeled is getting more difficult with process shrinks, so integrity considerations are likely more relevant now than at any time since transitor technology matured. Second, there's a huge amount of availability data gathered from real-world failure analysis, and it stretches back decades. There are some clear lessons learned, and the lessons sometimes apply in other contexts beyond computing, such as in aviation safety, military campaigns, nuclear power plant operations, and so on. One lesson that is widely understood and recognized is a "defense in depth" strategy. You said it yourself: "a lot of the failure comes from software." "High Availability" clustering software, for example, is also software. There is indeed some excellent clustering software, but it isn't always perfect. If the system availability is well defended, in depth, then the clustering software is not the only line of defense, and vice versa. It's simply wise practice to push as much of the availability engineering across and down as deep as possible into the computing infrastructure, so it "just works" if called upon. And if one level fails to do its job, there's at least one other layer of protection. In aviation safety the experts talk about "failure chains" (or "error chains"), the concept that an aviation failure resulting in loss of life or injury usually only happens when there are multiple failures combined. If only one error or failure results in a catastrophe, then aeronautical safety engineers figure out ways to lengthen that too-short chain. And if those multiple errors/failures do occur, they look for ways to break the chains, to reduce or eliminate correlation between errors/failures -- to make them as independent and rare as possible. The same philosophy applies to mission critical computing, fundamentally. Third, it's often not acceptable to schedule service downtime. Especially (but not only) for application services that are stateful, and that must be truly continuous (or as near continuous as possible), there's tremendous value in being able to add capacity, upgrade firmware, replace parts, and otherwise change system components while the applications keep running, with point-in-time data consistency, respecting ACID properties, and so on. Even if you cluster, often you cannot afford to bring down the whole cluster in order to upgrade or service it. (IBM Z clustering options are unique in many ways, including the fact they are not solely or predominantly software-based.) Fourth, no, it's not always possible to scale horizontally, although you certainly can with IBM Z and LinuxONE machines if the workloads allow it. Programmers and others keep trying to improve parallel processing efficiencies, but there are some programs that can never split well across multiple servers. A somewhat popular analogy here is human gestation. Human pregnancies currently last about 40 weeks. It's not currently possible for four women to reduce that gestation time to 10 weeks or even to 25 weeks. Adding gestational resources doesn't help reduce that elapsed time. Human gestation is inherently single threaded. And, extending the analogy, if you want to give birth to a baby elephant, you need one bigger mother (an elephant). Anyway, there are such problems in computing, "big server" problems. Solving many of those computing problems quickly, correctly, reliably, and securely is often extremely valuable to businesses and governments. If anything, there seem to be more such computing problems emerging lately. Many of these problems match up terrifically with IBM Z and LinuxONE platform capabilities, and some don't. Fifth, securely encrypting everything, at multiple levels, and at scale is an unavoidable requirement if we're ever going to protect civilization from data breaches, privacy invasions, and associated bad outcomes. IBM z14 and LinuxONE Emperor II machines are unique in that respect, too. (Predecessor models allow you to move closer to that ideal.) Finally, since workloads vary in their needs and characteristics, sometimes a lot, *thank goodness* there are a few different computing platform choices. -------------------------------------------------------------------------------------------------------- Timothy Sipples IT Architect Executive, Industry Solutions, IBM Z and LinuxONE, AP/GCG/MEA E-Mail: sipp...@sg.ibm.com ---------------------------------------------------------------------- For LINUX-390 subscribe / signoff / archive access instructions, send email to lists...@vm.marist.edu with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390 ---------------------------------------------------------------------- For more information on Linux on System z, visit http://wiki.linuxvm.org/