John Campbell wrote:
>The zSeries is AT LEAST 5 9's hardware.

Philipp Kern wrote:
>Does that matter in today's world? Would you avoid building for failure
>when a lot of the failure comes from software anyway? Do you then host
>multiple Linux VMs on the same iron to account for that? If so, why
>can't that scale horizontally?

That's a great question. I'll attempt an answer, and then if you have some
follow-up questions, great.

First of all, I hope we can agree that execution and data integrity are
frequently important. If the processor occasionally reports "5" as the
answer to 2+2, or if a bit gets flipped within one field in a fund transfer
record, that glitch could be catastrophic. There's more technology in IBM Z
and LinuxONE systems to prevent integrity errors than in any other
computing platform (so far as I'm aware). Keeping electrons properly
channeled is getting more difficult with process shrinks, so integrity
considerations are likely more relevant now than at any time since
transitor technology matured.

Second, there's a huge amount of availability data gathered from real-world
failure analysis, and it stretches back decades. There are some clear
lessons learned, and the lessons sometimes apply in other contexts beyond
computing, such as in aviation safety, military campaigns, nuclear power
plant operations, and so on. One lesson that is widely understood and
recognized is a "defense in depth" strategy.

You said it yourself: "a lot of the failure comes from software." "High
Availability" clustering software, for example, is also software. There is
indeed some excellent clustering software, but it isn't always perfect. If
the system availability is well defended, in depth, then the clustering
software is not the only line of defense, and vice versa. It's simply wise
practice to push as much of the availability engineering across and down as
deep as possible into the computing infrastructure, so it "just works" if
called upon. And if one level fails to do its job, there's at least one
other layer of protection.

In aviation safety the experts talk about "failure chains" (or "error
chains"), the concept that an aviation failure resulting in loss of life or
injury usually only happens when there are multiple failures combined. If
only one error or failure results in a catastrophe, then aeronautical
safety engineers figure out ways to lengthen that too-short chain. And if
those multiple errors/failures do occur, they look for ways to break the
chains, to reduce or eliminate correlation between errors/failures -- to
make them as independent and rare as possible. The same philosophy applies
to mission critical computing, fundamentally.

Third, it's often not acceptable to schedule service downtime. Especially
(but not only) for application services that are stateful, and that must be
truly continuous (or as near continuous as possible), there's tremendous
value in being able to add capacity, upgrade firmware, replace parts, and
otherwise change system components while the applications keep running,
with point-in-time data consistency, respecting ACID properties, and so on.
Even if you cluster, often you cannot afford to bring down the whole
cluster in order to upgrade or service it. (IBM Z clustering options are
unique in many ways, including the fact they are not solely or
predominantly software-based.)

Fourth, no, it's not always possible to scale horizontally, although you
certainly can with IBM Z and LinuxONE machines if the workloads allow it.
Programmers and others keep trying to improve parallel processing
efficiencies, but there are some programs that can never split well across
multiple servers. A somewhat popular analogy here is human gestation. Human
pregnancies currently last about 40 weeks. It's not currently possible for
four women to reduce that gestation time to 10 weeks or even to 25 weeks.
Adding gestational resources doesn't help reduce that elapsed time. Human
gestation is inherently single threaded. And, extending the analogy, if you
want to give birth to a baby elephant, you need one bigger mother (an
elephant).

Anyway, there are such problems in computing, "big server" problems.
Solving many of those computing problems quickly, correctly, reliably, and
securely is often extremely valuable to businesses and governments. If
anything, there seem to be more such computing problems emerging lately.
Many of these problems match up terrifically with IBM Z and LinuxONE
platform capabilities, and some don't.

Fifth, securely encrypting everything, at multiple levels, and at scale is
an unavoidable requirement if we're ever going to protect civilization from
data breaches, privacy invasions, and associated bad outcomes. IBM z14 and
LinuxONE Emperor II machines are unique in that respect, too. (Predecessor
models allow you to move closer to that ideal.)

Finally, since workloads vary in their needs and characteristics, sometimes
a lot, *thank goodness* there are a few different computing platform
choices.

--------------------------------------------------------------------------------------------------------
Timothy Sipples
IT Architect Executive, Industry Solutions, IBM Z and LinuxONE, AP/GCG/MEA
E-Mail: sipp...@sg.ibm.com

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to lists...@vm.marist.edu with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390
----------------------------------------------------------------------
For more information on Linux on System z, visit
http://wiki.linuxvm.org/

Reply via email to