I am sponsoring this case on behalf of Gavin Maltby. This case seeks patch
binding, with the generic MCA FMA support it delivers targeting a Solaris
update release. The case straightforwardly extends PSARC/2006/020 and
PSARC/2006/564 and adds FMA fault and ereport events suitable for generation
and consumption on all x86 platforms.
Cindi
This information is Copyright 2007 Sun Microsystems
1. Introduction
1.1. Project/Component Working Name:
Generic x86 Machine Check Architecture FMA
1.2. Name of Document Author/Supplier:
Author: Gavin Maltby
1.3 Date of This Document:
10 October, 2007
4. Technical Description
4.1. Project Summary
4.1.1 Project Description:
CPU, and in some cases memory, error report telemetry falls
under the Machine Check Architecture which all recent AMD and
Intel x86 chip offerings conform to. Existing FMA support for x86
chips is limited to AMD family 0xf alone - Opteron/Athlon64/Turion64
revisions B to F that have shipped in Sun x64 systems to date.
In expanding this support to include the upcoming AMD family 0x10
and new Sun Intel-based x64 products we have implemented a generic
machine check architecture FMA implementation that supports all
chips that support the MCA, and have layered the existing AMD
support and additional model-specific support on top of that.
This case presents the new FMA events that are defined for the
new generic MCA implementation, and modifies the definitions
of existing AMD events. The associated FMA portfolio referenced
below has already been approved at FMA portfolio review.
4.1.2 Details:
The existing implementation for AMD family 0xf delivered
a "cpu module" cpu.AuthenticAMD.15 which contains all the smarts
for error handling, error classification, error logging etc.
The telemetry arising from this module is consumed by the
eversholt diagnosis engine, applying a set of AMD-specific rules.
For any other chip type (i.e., not AMD family 0xf) a dumb
cpu.generic module provided token support for machine check traps,
and raises no telemetry for diagnosis.
In the new implementation most of the existing AMD code is
refactored into an improved and fully FMA-aware cpu.generic.
Model-specific aspects such as the more-detailed error classification
possible with AMD chips (i.e., classification beyond what is
considered "architectural" in the generic MCA) is performed
in a relatively lightweight "model-specific cpu module" that is
layered on top of cpu.generic. The following model-specific
modules will be delivered:
Module Description
----------------------- ---------------------------------------
cpu_ms.AuthenticAMD.15 AMD family 0xf model-specific support
cpu_ms.AuthenticAMD "Generic AMD" model-specific support
cpu_ms.GenuineIntel "Generic Intel" model-specific support
The following support combinations are possible:
a) cpu.generic absent
Ereport classes: None
Fault classes:
The system will have zero MCA capability and no FMA event
will be raised for cpu or memory.
b) cpu.generic with no model-specific module
Ereport classes: ereport.cpu.generic-x86.*
Fault classes: fault.cpu.generic-x86.*
This applies on chips that are not from AMD or Intel. It also
applies if no AMD or Intel model-specific support initializes,
which could occur on an error or on disabling and would
also occur for new chip families from AMD and Intel that
not even the "generic AMD" and "generic Intel" model-specific
modules are able to claim support for.
Telemetry will be raised entirely by cpu.generic in the
ereport.cpu.generic-x86.* ereport class, with no model-specific
augmentation. It will be diagnosed by a corresponding set
of eversholt rules to produce faults in the class
fault.cpu.generic-x86.*.
c) cpu.generic with cpu_ms.AuthenticAMD.15
Ereport classes: ereport.cpu.amd.*
Fault classes: fault.cpu.amd.*, fault.memory.*
This applies to AMD family 0xf systems, i.e. it is the
combination that replaces cpu.AuthenticAMD.15 that
is the only current Solaris MCA support for x86 systems
To maintain compatibility, this combination will produce
ereports in the precisely the same classes as before -
ereport.cpu.amd.*, and the existing AMD eversholt rules
will diagnose these to fault.cpu.amd.* and fault.memory.amd.*.
This is achieved by cpu.generic allowing model-specific
support to classify the error and provide the ereport class
to use. The ereport payload *is* changed since most of it
is now generated by cpu.generic with only a limited degree
of augmentation from the model-specific support; this change
is limited to the renaming of a number of ereport payload
members, and the addition of a number of new members. The
few AMD diagnosis rules that access payload members are
changed to use the new names. In the events defintions,
existing AMD ereports are retained as payload version 0
while the revised (renamed, extended) payload is introduced
as payload version 1.
d) cpu.generic with cpu_ms.AuthenticAMD
Ereport classes: ereport.cpu.generic-x86.*
Fault classes: fault.cpu.generic-x86.*,
fault.memory.generic-x86.*
This will apply on AMD systems where no more-specific
model-specific support exists or initializes; this
will include AMD family 0x10 systems.
Most telemetry is provided entirely by cpu.generic.
Since these AMD systems include an on-chip memory-controller
which falls under the MCA umbrella we can also
recognise and diagnose memory errors, and this is
done in the model-specific module.
e) cpu.generic with cpu_ms.GenuineIntel
Ereport classes: ereport.cpu.intel.*
Fault classes: fault.cpu.intel.*,
This case presents the FMA events for a) - d). A separate
case will present the events for e) along with additional
Intel memory-controller-hub events.
The accompanying erecheck.html details all changes to the
SMI Events Registry.
The associated FMA portfolio is here:
http://wikihome.sfbay.sun.com/fma-portfolio/Wiki.jsp?page=2007.025.x86MCA
The portfolio includes full philosophy and diagnosis documents.
4.5. Interfaces:
The "cpu module interface" was introduced by the AMD FMA work.
It is a project-private interface. The current work substantially
revises this interface and, in particular, prepares it to work
in the presence of a hypervisor such as Solaris xVM (changing it
from a "cpu_t interface" to a "chip/core/strand interface".
The interface remains project-private.
4.9. I18N/L10N Impact:
A new FMA dictionary and .po/.mo "GMCA" are delivered.
5. Reference Documents:
PSARC/2006/564 FMA for Athlon 64 and Opteron Rev F/G Processors
PSARC/2006/020 FMA for Athlon 64 and Opteron Processors
6. Resources and Schedule
6.4. Steering Committee requested information
6.4.1. Consolidation C-team Name:
ON
6.5. ARC review type: Automatic
6.6. ARC Exposure: open