On Mon, 13 May 2019 21:17:32 -0600, Grant Taylor wrote:

>On 5/13/19 9:46 AM, John McKown wrote:
>> Yes, we have had a TCM fail. I was almost called a liar when I told the
>> Windows people that the z simply switch the work transparently (on the
>> hardware level) to another CP. They were shocked and amazed that we
>> could "hot swap" a new TCM into the box without any outage.
>
>TCM as in Thermal Conduction Module?
>
>IMHO that's mildly impressive.  I say /mildly/ because I would /expect/
>a mainframe to be able to survive that without (significantly) impacting
>the workload.

Mildly?
You can leave out the parenthetical "significantly".
z machines can take a hard failure of a CP and a spare is switched in 
dynamically to take over the work. The unit of work that was running on 
that processor is moved to the new processor without interruption. 
There is a brief pause while it is switched, but the workload is not impacted. 
The operating system does not have to do anything to make this happen. 
It is done entirely in hardware. The failed processor can even be running 
critical operating system functions. It makes no difference.

>I also would like to think that some of the more advanced schedulers in
>Linux could detect that a CPU (set of cores) was overheating and needed
>to be taken out of service.  I would hope that if the chassis was
>designed properly, a good CE could replace the TCM without taking the
>machine down.

Sure, detecting a potential failure situation and responding to that should 
be relatively trivial.

>I'm also assuming that the CPU was not actually faulted and would still
>pass sanity checks as long as no actual work was scheduled on it.

That's the big difference, isn't it?

-- 
Tom Marchant

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Reply via email to