Re: Check out Massive Amazon cloud service outage disrupts sites

Anne & Lynn Wheeler Wed, 01 Mar 2017 17:29:58 -0800

[email protected] (Charles Mills) writes:
> It is hard to prepare for unknown unknowns. It is legendary that people have
> had recovery failures because the fallover switch (channel, power, network,
> whatever) failed.


re:
http://www.garlic.com/~lynn/2017c.html#13 Check out Massive Amazon cloud 
service outage disrupts sites

unks unks are frequently not having done detailed end-to-end evaluations
of all possible scenarios.

we started out claiming no-single-point-of-failure ... but it required
doing end-to-end walk through looking for all sort of critical
components (I was also asked to review IBM RAID designs and sometimes
would uncover single-point-of-failure in the least anticipated places).
This included replicated fallover switches and process over precedence
of failover (as part of handling some race conditions).

Also needed a inverse "RESERVE" ... there is a failure case where a
processor gets suspended just before a write operation, which then
kicks-off recovery processes. The processor that is assumed to have
failed has to be "fenced off" from proceeding with a write operation
when it wakes up (aka RESERVE allows only one processor to write and
prevents all other, inverse "RESERVE" blocks one or more identified
processors from writing; there also has to be tie-breaker process for
race conditions).

"real" no-signel-point-of-failures contributed to having to specify
geographically separated operation.

we also started defining what was needed to handle multiple points of
failure ... and looking at 5-nines availability configurations.
http://www.garlic.com/~lynn/submain.html#available

As undergraduate at the univ ... I was first hired as fulltime person to
be responsible for IBM production mainframe systems. Then before
graduation, I was hired fulltime by Boeing to help with creation of
Boeing Computer Serivces (consolidate all dataprocessing in an
independent business unit to better monetize the investment, including
offering services to non-Boeing entities). I thot Renton datacenter was
possibly largest in the world with something like $300M (late 60s
dollars) in ibm mainframes (360/65s were arriving faster than they could
be installed). 747#3 was flying skies of seattle getting FAA
flt. certification. There was also decision to replicate Renton up at
the new 747 plant in Everett ... there was disaster scenario where
Mt. Rainier heats up and the resulting mudslide takes out the Renton
datacenter).

I finally join IBM (science center) after graduation ... some past
posts
http://www.garlic.com/~lynn/subtopic.html#545tech

One of the things that was exposed in the 70s with respect to IBM dasd
... was IBM mainframe channel/dasd had (for some time, up through 80s) a
undetectable power interruption failure mode in the middle of write
operation ... control/dasd had sufficient power to complete write
correctly, but there wasn't sufficient power to transfer data from
processor memory ... so the record write was completed with all zeros
with valid error correcting codes. In CMS case, MFD is somewhat
equivalent of OS VTOC, change was made to have pairs of alternating MFD
records, with sequence appended. A power-interrupted MFD write would
zero all or part of appended sequence ... so it wouldn't appear most
current during recovery (and the other MFD would be used). Towards the
mid-80s there was controller work to try and handle the case (for
operating systems that didn't know how). Later hardware solution was
that all the data had to be available for the write to start.

later they let me play disk engineer in bldgs 14&15 ... some past posts
http://www.garlic.com/~lynn/subtopic.html#disk

they had bunch of mainframes for dasd engineering testing that were
scheduled stand-alone 7x24 around the clock. They had once tried to do
testing under MVS ... but in that environment, MVS had 15min MTBF,
requiring manual re-ipl. I offerred to redo input/output supervisor that
was bullet proof and never fail ... greatly improving productivity,
allowing anytime, on-demand concurrent testing. When I wrote up the
wrote in an internal report, I may have made a mistake mentioning the
MVS 15min MTBF ... because I was later told that the MVS RAS group
did their best to have me separated from the company.

A couple years later ... field engineering had 3880 controller error
regression test with 57 "injected" errors (that they considered typical
and likely to occur). MVS was failing in all 57 cases (requiring manual
re-ipl) ... and in 2/3rds of the cases, no indiciation of what was
responsible for the failure ... previously posted old email
http://www.garlic.com/~lynn/2007.html#email801015

trivia: I had worked with Jim Gray at IBM SJR ... before he left for
Tandem. At Tandem he does a detailed analysis of failure modes, finding
that hardware was in the process of becoming significantly more reliable
... and failures were starting to shift to human error, software bugs,
and environmental (power, acts of nature, etc) ...  copy of summary from
that study
http://www.garlic.com/~lynn/grayft84.pdf

-- 
virtualization experience starting Jan1968, online at home since Mar1970

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: Check out Massive Amazon cloud service outage disrupts sites

Reply via email to